RAG implementation: a complete analysis of 4 strategies for text segmentation

The application of RAG technology in text processing improves the efficiency of retrieval and generation.
Core content:
1. RAG technology process and the importance of text segmentation
2. The three elements of text segmentation strategy and the comparison of different strategies
3. Recommended segmentation strategy combination and detailed introduction of segmenters
1. What is RAG
RAG is an AI technology that combines retrieval and generation. Its process usually includes the following stages:
Text Splitting
Vectorized encoding (Embedding)
Store in vector database (such as FAISS / Chroma / Milvus, etc.)
Search for similar passages
Generate Answers
2. Why do we need to segment text?
Large models (such as GPT) cannot directly retrieve the entire document. We must first divide the document into chunks of appropriate size, and then embed each chunk . If we divide it too finely, we will lose the context; if we divide it too coarsely, the embedding will be inaccurate or exceed the context window.
Text segmentation is very important in RAG, because improper segmentation may:
✂️ Interrupt semantic units (e.g. cut a sentence in half)
? Reduce recall accuracy
? LLM’s understanding is incomplete and the output is poor
The goals of text segmentation are:
Divide large documents into several small chunks to facilitate embedding and subsequent retrieval.
Reasonable segmentation makes retrieval more accurate.
3. Text segmentation strategy
Three elements of block division:
Block strategy comparison table:
LlamaIndex provides multiple built-in TextSplitter classes to handle documents of different languages and structures.
Commonly used splitters:
SemanticSplitterNodeParser |
✅ 4. Recommended segmentation strategy combination
Simple document: use SentenceSplitter + fixed chunk size
High context requirements: Use SentenceWindowNodeParser sentence window to keep context continuous
High-quality QA: SemanticSplitterNodeParser, semantic segmentation (requires additional installation of small models)
? 5. Detailed introduction of the splitter
1. Sentence Splitter
Prioritize complete sentences when parsing text; this class attempts to keep sentences and paragraphs together.
parameter:
name | type | describe | default |
---|---|---|---|
chunk_size | int | The token size of each block. | 1024 |
chunk_overlap | int | The tokens of each block overlap during the split. | 200 |
separator | str | The default word separator |
separator is very important, common sentence ending punctuation in multiple languages:
|
paragraph_separator | str | The separator between paragraphs. | '\n\n\n' |
secondary_chunking_regex |
| Backup regular expression for splitting sentences. | '[^,.;.! ]+[,.;.! ]?|[,.;.! ]' |
Install Dependencies
pip install llama-index llama-index-embeddings-huggingface
split_text_demo.py
from llama_index.core import SimpleDirectoryReaderfrom llama_index.core.node_parser import SentenceSplitter# 1. Load documentsdocuments = SimpleDirectoryReader(input_files=[r"D:\Test\LLMTrain\day20_llamaindex\data\ai.txt"]).load_data()#print(f"Original number of documents: {len(documents)}")# 2. Create SentenceSplittersentence_splitter = SentenceSplitter( chunk_size=100, chunk_overlap=10, separator="。!!?.\n¡¿", # Suitable for separators in Chinese, English and Spanish )# 3. Split documentsnodes = sentence_splitter.get_nodes_from_documents(documents)print(f"Number of generated nodes: {len(nodes)}")print("Length of the first 3 parts of the segmented example:", [len(n.text) for n in nodes[:3]])print("\n? Segmentation result example:")for i, node in enumerate(nodes[:3]): print(f"\nChunk {i + 1}:\n{node}")
About chunk_overlap
chunk_overlap indicates the number of characters or tokens that overlap between two adjacent text blocks.
What it does is:
Preserve contextual continuity (avoid important information fragmentation)
Improve retrieval and generation quality (e.g. multi-round question answering)
Sometimes, even if the value is set, the separated statements do not overlap, because the "statement boundary" is prioritized for segmentation.
It is not strictly divided into blocks of fixed length by number of characters
So chunk_overlap means "overlap statements as much as possible" rather than precisely controlling character overlap.
2. Fixed block segmentation
TokenTextSplitter splits according to a fixed number of tokens and is currently used in few scenarios.
parameter:
Name | type | describe | default |
---|---|---|---|
chunk_size | int | The token block size per block. | 1024 |
chunk_overlap | int | The tokens of each block overlap during the split. | 20 |
separator | str | The default word separator | ' ' |
backup_separators | List | Additional delimiter to use for splitting. | <dynamic> |
keep_whitespaces | bool | Whether to preserve leading/trailing whitespace in blocks. | False |
token_text_splitter_demo.py
#Use fixed node segmentation
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import TokenTextSplitter
documents = SimpleDirectoryReader(input_files=[ r"D:\Test\LLMTrain\day20_llamaindex\data\ai.txt" ]).load_data()
fixed_splitter = TokenTextSplitter(chunk_size= 256 , chunk_overlap= 20 )
fixed_nodes = fixed_splitter.get_nodes_from_documents(documents)
print ( "Fixed block example: " , [ len (n.text) for n in fixed_nodes[: 3 ]])
print ( print ( "First node content:\n" , fixed_nodes[ 0 ].text))
print ( "===========" )
print ( print ( "Second node content:\n" , fixed_nodes[ 1 ].text))
3. Sentence Window Segmentation
SentenceWindowNodeParser is an advanced text parser provided by LlamaIndex, designed for RAG (Retrieval-Augmented Generation) scenarios. Its core function is to split the document into sentences and attach contextual information of several sentences before and after each sentence node , thereby providing richer semantic background in the retrieval and generation stages.
Core Features
Sentence-level segmentation: Split the document into independent sentences, with each sentence as a node.
Context window: Attach the contents of several sentences before and after each sentence node to form a "sentence window" to provide context information.
Metadata storage: The context window information is stored in the node's metadata for easy use in subsequent retrieval and generation stages.
parameter:
sentence_splitter | Optional[Callable] | Split text into sentences | <function split_by_sentence_tokenizer.<locals>.<lambda> at 0x7b5051030a40> |
include_metadata | bool | Whether to include metadata in the node | Required |
include_prev_next_rel | bool | Whether to include previous/next relationships | Required |
window_size | int | 3 | |
window_metadata_key | str | 'window' | |
original_text_metadata_key | str | The metadata key used to store the original sentence. | 'original_text' |
from llama_index.core import Document
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceWindowNodeParser
text = "hello. how are you? I am fine! aaa;ee. bb,cc"
# Define sentence parser
node_parser = SentenceWindowNodeParser.from_defaults(
window_size = 3 ,
window_metadata_key= "window" ,
original_text_metadata_key= "original_text" ,
)
#print(node_parser)
#documents = SimpleDirectoryReader(input_files=[r"D:\Test\LLMTrain\day20_llamaindex\data\ai.txt"]).load_data()
nodes = node_parser.get_nodes_from_documents([Document(text=text)])
print ([x.text for x in nodes])
print ( "-" * 20 )
print ( "Metadata of the first node" , nodes[ 0 ].metadata)
print ( "-" * 20 )
print ( "Metadata of the last node" , nodes[ 4 ].metadata)
Result analysis:
It can be seen that the English string "hello. how are you? I am fine! aaa;ee. bb,cc" is split into five sentences. SentenceWindowNodeParser recognizes and splits sentences according to the punctuation marks at the end of the sentence : period (.), greeting (?), and exclamation mark (!) .
When the document is split, both the window data and the document data will be stored in the node's metadata and represented by the custom window_metadata_key and original_text_metadata_key.
We check the metadata of the first document. The first document is the first sentence of the original document, so the window data only contains the current sentence and the next two sentences, a total of three sentences.
The last document of the node. Because it is the last document, its window data only contains the first three sentences of the current sentence and the current sentence, a total of 4 sentences.
Notice:
The SentenceWindowNodeParser in LlamaIndex can only recognize half-width English punctuation marks, which will make it impossible to segment Chinese documents.
Solution: Replace all the full-width symbols (period, question mark, exclamation mark!) in the Chinese document with the corresponding half-width punctuation marks. Add an extra space after them, and the Chinese characters can be segmented.
from llama_index.core import Document
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceWindowNodeParser
text = "Hello, nice to meet you. It's already 10 o'clock, but I don't want to get up! It's snowing! Have you finished your homework?"
text = text.replace( '.' , '. ' )
text = text.replace( '!' , '! ' )
text = text.replace( '' , '? ' )
# Define sentence parser
node_parser = SentenceWindowNodeParser.from_defaults(
window_size = 3 ,
window_metadata_key= "window" ,
original_text_metadata_key= "original_text" ,
)
#print(node_parser)
#documents = SimpleDirectoryReader(input_files=[r"D:\Test\LLMTrain\day20_llamaindex\data\ai.txt"]).load_data()
nodes = node_parser.get_nodes_from_documents([Document(text=text)])
print ([x.text for x in nodes])
print ( "-" * 20 )
print ( "Metadata of the first node" , nodes[ 0 ].metadata)
print ( "-" * 20 )
print ( "The metadata of the second last node" , nodes[ 2 ].metadata)
4. Semantic splitter
Parameter Description
The role of buffer_size:
The main function of SemanticSplitterNodeParser is to divide the document into multiple nodes according to semantic similarity, and each node contains a set of semantically related sentences. In this process, buffer_size determines the number of sentences that the model considers each time when calculating semantic similarity .
For example, setting buffer_size=3 means that the model will evaluate the semantic similarity of 3 consecutive sentences as a unit each time. This helps determine whether breaks should be inserted between these sentences to form new nodes.
Parameter setting suggestions
Smaller buffer_size (such as 1 or 2): Suitable for documents with frequent content changes or loose structures, which helps to capture semantic changes more finely.
Larger buffer_size (such as 5 or 10): suitable for compact and semantically coherent documents, helping to reduce unnecessary segmentation.
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import SimpleDirectoryReader
embed_model = HuggingFaceEmbedding(model_name= r"D:\Test\LLMTrain\testllm\llm\BAAI\bge-m3" )
documents = SimpleDirectoryReader(input_files=[ r"D:\Test\LLMTrain\day20_llamaindex\data\ai.txt" ]).load_data()
parser = SemanticSplitterNodeParser(
embed_model=embed_model,
buffer_size = 2
)
nodes = parser.get_nodes_from_documents(documents)
print ( f"? A total of { len (nodes)} semantic blocks were generated" )
print ( f"\nSample block:\n {nodes[ 0 ].get_content()} " )
VI. Summary and Suggestions