RAG15 block strategies are summarized and introduced.

A comprehensive analysis of 15 text segmentation strategies to gain an in-depth understanding of the advantages and limitations of each strategy.
Core content:
1. The ease of implementation and potential problems of fixed-size segmentation
2. The advantages and challenges of sentence-based segmentation in retaining semantic units
3. The role and limitations of paragraph-based segmentation in processing structured documents
In the previous article, we introduced 5 common block strategies. In this article, we will introduce 15 block strategies for summary.
Fixed size chunks
01
Fixed-size chunking splits the document into chunks of a predefined size, usually by word count, token count, or character count.
When you need a straightforward approach and the document structure is not important. It works well with smaller, less complex documents.
Advantages:
Easy to implement.
Consistent block size.
Quick calculations.
Disadvantages:
Sentences or paragraphs may be disconnected, thus losing context.
This is not ideal for documents where preserving meaning is important.
Sentence-based chunking
02
This method chunks the text based on natural sentence boundaries. Each chunk contains a certain number of sentences, preserving semantic units. Maintaining coherent thoughts is crucial, and splitting in the middle of a sentence results in a loss of meaning.
Advantages:
Preserve sentence-level meaning.
Better context retention.
Disadvantages:
The chunk sizes are not uniform because the sentences are of different lengths.
When the sentence is too long, it may exceed the token limit in the model.
import spacynlp = spacy.load("en_core_web_sm") def sentence_chunk(text): doc = nlp(text) return [sent.text for sent in doc.sents] # Applying Sentence-Based Chunkingsentence_chunks = sentence_chunk(sample_text)for chunk in sentence_chunks: print(chunk, '\n---\n')
Based on paragraph chunking
03
This strategy splits the text based on paragraph boundaries, treating each paragraph as a block. It is best suited for structured documents such as reports or essays, where each paragraph contains a complete idea or argument.
Advantages:
Natural document segmentation.
Keep the larger context within your paragraphs.
Disadvantages:
The paragraphs are of different lengths, resulting in uneven block sizes.
Long paragraphs may still exceed the tag limit.
def paragraph_chunk(text): paragraphs = text.split('\n\n') return paragraphs # Applying Paragraph-Based Chunkingparagraph_chunks = paragraph_chunk(sample_text)for chunk in paragraph_chunks: print(chunk, '\n---\n')
Semantic-based chunking
04
This approach uses machine learning models (such as transformers) to split text into chunks based on semantic meaning. It is useful when retaining the highest level of context is critical, such as in complex technical documents.
Advantages:
Chunks that have contextual meaning.
Capture semantic relationships between sentences.
Disadvantages:
Requires advanced NLP models which are computationally expensive.
More complex to implement
def semantic_chunk(text, max_len=200): doc = nlp(text) chunks = [] current_chunk = [] for sent in doc.sents: current_chunk.append(sent.text) if len(' '.join(current_chunk)) > max_len: chunks.append(' '.join(current_chunk)) current_chunk = [] if current_chunk: chunks.append(' '.join(current_chunk)) return chunks # Applying Semantic-Based Chunkingsemantic_chunks = semantic_chunk(sample_text)for chunk in semantic_chunks: print(chunk, '\n---\n')
Multimodal chunking
05
This strategy treats different content types (text, images, tables) separately. Each modality is chunked independently based on its characteristics. Suitable for documents containing various content types, such as PDFs or technical manuals with mixed media.
Advantages:
Tailor-made for mixed media documents.
Allows custom handling of different modalities.
Disadvantages:
Complex to implement and manage.
Each mode requires different processing logic.
def modality_chunk(text, images=None, tables=None): # This function assumes you have pre-processed text, images, and tables text_chunks = paragraph_chunk(text) return {'text_chunks': text_chunks, 'images': images, 'tables': tables} # Applying Modality-Specific Chunkingmodality_chunks = modality_chunk(sample_text, images=['img1.png'], tables=['table1'])print(modality_chunks)
Sliding window chunking
06
Sliding window chunking creates overlapping chunks of data, allowing each chunk to share part of its content with the next. When you need to ensure continuity of context between chunks, such as in legal or academic documents.
Advantages:
Preserve context across chunks of data.
Reduce information loss at data block boundaries.
Disadvantages:
May introduce redundancy by repeating content in multiple chunks.
More processing is needed.
def sliding_window_chunk(text, chunk_size=100, overlap=20): tokens = text.split() chunks = [] for i in range(0, len(tokens), chunk_size - overlap): chunk = ' '.join(tokens[i:i + chunk_size]) chunks.append(chunk) return chunks # Applying Sliding Window Chunkingsliding_chunks = sliding_window_chunk(sample_text)for chunk in sliding_chunks: print(chunk, '\n---\n')
Layered Block
07
Hierarchical chunking divides a document at multiple levels, such as sections, subsections, and paragraphs. For highly structured documents, such as academic papers or legal texts, maintaining a hierarchical structure is essential.
Advantages:
Preserve document structure.
Maintain context at multiple levels of granularity.
Disadvantages:
It is more complicated to implement.
May result in uneven lumps.
def hierarchical_chunk(text, section_keywords): sections = [] current_section = [] for line in text.splitlines(): if any(keyword in line for keyword in section_keywords): if current_section: sections.append("\n".join(current_section)) current_section = [line] else: current_section.append(line) if current_section: sections.append("\n".join(current_section)) return sections # Applying Hierarchical Chunkingsection_keywords = ["Introduction", "Overview", "Methods", "Conclusion"]hierarchical_chunks = hierarchical_chunk(sample_text, section_keywords)for chunk in hierarchical_chunks: print(chunk, '\n---\n')
Content-Aware Chunking
08
This approach adjusts based on content characteristics (e.g., chunking text at the paragraph level, treating tables as separate entities). For documents with heterogeneous content, such as e-books or technical manuals, chunking must vary based on the content type.
Advantages:
Flexible and adaptable to different content types.
Maintain document integrity across multiple formats.
Disadvantages:
Requires complex dynamic chunking logic.
This is difficult to achieve for documents with diverse content structures.
def content_aware_chunk(text): chunks = [] current_chunk = [] for line in text.splitlines(): if line.startswith(('##', '###', 'Introduction', 'Conclusion')): if current_chunk: chunks.append('\n'.join(current_chunk)) current_chunk = [line] else: current_chunk.append(line) if current_chunk: chunks.append('\n'.join(current_chunk)) return chunks # Applying Content-Aware Chunkingcontent_chunks = content_aware_chunk(sample_text)for chunk in content_chunks: print(chunk, '\n---\n')
Table-aware chunking
09
This strategy specifically handles document tables by extracting them into separate chunks and converting them to formats such as markdown or JSON for easier processing. For documents containing tabular data, such as financial reports or technical documents, the tables contain important information.
Advantages:
Preserve table structure for efficient downstream processing.
Allows independent processing of tabular data.
Disadvantages :
During the conversion process, formatting may be lost.
Tables with complex structures require special handling.
import pandas as pd def table_aware_chunk(table): return table.to_markdown() # Sample table datatable = pd.DataFrame({ "Name": ["John", "Alice", "Bob"], "Age": [25, 30, 22], "Occupation": ["Engineer", "Doctor", "Artist"]}) # Applying Table-Aware Chunkingtable_markdown = table_aware_chunk(table)print(table_markdown)
Token-based chunking
10
Token-based chunking splits text based on a fixed number of tokens instead of words or sentences. It uses a tokenizer from an NLP model. For models that operate on tokens, such as transformer-based models with token restrictions (such as GPT-3 or GPT-4).
Advantages:
Applicable to transformer-based models.
Make sure to adhere to token limits.
Disadvantages:
Participles may split sentences or disrupt context.
Does not always align with natural language boundaries.
from transformers import GPT2Tokenizer tokenizer = GPT2Tokenizer.from_pretrained("gpt2") def token_based_chunk(text, max_tokens=200): tokens = tokenizer(text)["input_ids"] chunks = [tokens[i:i + max_tokens] for i in range(0, len(tokens), max_tokens)] return [tokenizer.decode(chunk) for chunk in chunks] # Applying Token-Based Chunkingtoken_chunks = token_based_chunk(sample_text)for chunk in token_chunks: print(chunk, '\n---\n')
Entity-based chunking
11
Entity-based chunking leverages Named Entity Recognition (NER) to divide text into chunks based on recognized entities, such as people, organizations, or locations. This is useful for documents where specific entities must be maintained as contextual units, such as resumes, contracts, or legal documents.
Advantages:
Keep named entities unchanged.
Retrieval accuracy can be improved by focusing on related entities.
Disadvantages:
A trained NER model is required.
Entities may overlap, resulting in complex block boundaries.
def entity_based_chunk(text): doc = nlp(text) entities = [ent.text for ent in doc.ents] return entities # Applying Entity-Based Chunkingentity_chunks = entity_based_chunk(sample_text)print(entity_chunks)
Based on the topic
12
Use for documents that cover multiple topics, such as news articles, research papers, or reports with different themes.
Advantages:
Group related information together.
Helps to focus searches based on specific topics.
Disadvantages:
Additional processing (topic modeling) is required.
For short documents or overlapping topics, this may not be exact.
Based on page chunking
13
Split the document on page boundaries, which are typically used for PDF or formatted documents where each page is considered a block. For page documents where page boundaries have semantic significance, such as PDF or printable reports.
Advantages:
Easy to implement using PDF documents.
Respect page boundaries.
Disadvantages:
Pages may not correspond to natural text delimiters.
Context between pages may be lost.
def page_based_chunk(pages): # Split based on pre-processed page list (simulating PDF page text) return pages # Sample pagespages = ["Page 1 content", "Page 2 content", "Page 3 content"] # Applying Page-Based Chunkingpage_chunks = page_based_chunk(pages)for chunk in page_chunks: print(chunk, '\n---\n')
Keyword-based chunking
14
This method chunks documents based on predefined keywords or phrases that indicate a transition in topic (e.g., "Introduction", "Conclusion"). It is suitable for documents that follow a clear structure, such as scientific papers or technical specifications.
Advantages:
Capture natural topic separators based on keywords.
Suitable for structured documents.
Disadvantages:
Requires a predefined set of keywords.
Not suitable for unstructured text.
def keyword_based_chunk(text, keywords): chunks = [] current_chunk = [] for line in text.splitlines(): if any(keyword in line for keyword in keywords): if current_chunk: chunks.append('\n'.join(current_chunk)) current_chunk = [line] else: current_chunk.append(line) if current_chunk: chunks.append('\n'.join(current_chunk)) return chunks # Applying Keyword-Based Chunkingkeywords = ["Introduction", "Overview", "Conclusion", "Methods", "Challenges"]keyword_chunks = keyword_based_chunk(sample_text, keywords)for chunk in keyword_chunks: print(chunk, '\n---\n')
Mixed Blocks
15
Combine multiple chunking strategies based on content type and document structure. For example, text can be chunked by sentence, while tables and images are handled separately. Use for complex documents containing a variety of content types, such as technical reports, business documents, or product manuals.
Advantages:
Highly adaptable to various document structures.
Allows fine-grained control over different content types.
Disadvantages:
It is more complicated to implement.
Custom logic is required to handle each content type.
defhybrid_chunk(text): paragraphs = paragraph_chunk(text) hybrid_chunks = [] forparagraphin paragraphs: hybrid_chunks += sentence_chunk(paragraph) return hybrid_chunks # Applying Hybrid Chunkinghybrid_chunks = hybrid_chunk(sample_text)forchunkin hybrid_chunks: print(chunk, '\n---\n')
When building Retrieval Augmented Generation (RAG), it is critical to optimize chunking for specific use cases and document types. Different scenarios have different requirements based on document size, content diversity, and retrieval speed. Let’s explore some optimization strategies based on these factors.
Choosing the right chunking strategy depends on multiple factors, including document type, the need to preserve context, and the balance between retrieval speed and accuracy. Whether you are dealing with academic papers, legal documents, or mixed content files, choosing the right approach can significantly improve the effectiveness of your RAG. By iterating and optimizing your chunking approach, you can adapt to changing document types and user needs, ensuring your retrieval system remains robust and efficient.