RAG implementation: a complete analysis of 4 strategies for text segmentation

Written by
Caleb Hayes
Updated on:June-29th-2025
Recommendation

The application of RAG technology in text processing improves the efficiency of retrieval and generation.

Core content:
1. RAG technology process and the importance of text segmentation
2. The three elements of text segmentation strategy and the comparison of different strategies
3. Recommended segmentation strategy combination and detailed introduction of segmenters

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

1. What is RAG

RAG is an AI technology that combines retrieval and generation. Its process usually includes the following stages:

  • Text Splitting

  • Vectorized encoding (Embedding)

  • Store in vector database (such as FAISS / Chroma / Milvus, etc.)

  • Search for similar passages

  • Generate Answers


2. Why do we need to segment text?

Large models (such as GPT) cannot directly retrieve the entire document. We must first divide the document into chunks of appropriate size, and then embed each chunk . If we divide it too finely, we will lose the context; if we divide it too coarsely, the embedding will be inaccurate or exceed the context window.

Text segmentation is very important in RAG, because improper segmentation may:

  • ✂️ Interrupt semantic units (e.g. cut a sentence in half)

  • ? Reduce recall accuracy

  • ? LLM’s understanding is incomplete and the output is poor


The goals of text segmentation are:

  •  Divide large documents into several small chunks to facilitate embedding and subsequent retrieval.

  • Reasonable segmentation makes retrieval more accurate.


 3. Text segmentation strategy

Three elements of block division:

elements
illustrate
Recommended value
Block size
The length of each paragraph
200-500 words
Block Overlap
Adjacent blocks have duplicate content
10%-20%
Segmentation basis
Divide by sentence/paragraph/semantic
Optimal semantic segmentation


Block strategy comparison table:

Strategy Type
advantage
shortcoming
Applicable scenarios
Fixed size
Simple implementation
May cut off the complete semantics
Technical Documentation
Split by paragraph
Maintain logical integrity
Paragraph length varies widely
Literary Fiction
Semantic Segmentation
Ensure content integrity
Large consumption of computing resources
Professional field documents


LlamaIndex provides multiple built-in  TextSplitter  classes to handle documents of different languages ​​and structures.

Commonly used splitters:  

TextSplitter Type
Applicable situations
Chinese support
SentenceSplitter
Segment segmentation (suitable for natural language)
✅ Very suitable for Chinese
TokenTextSplitter
Block by Token Quantity
✅ Precise control of LLM input
SentenceWindowNodeParser
Sentence Window Method (Overlapping Segments)
✅ Suitable for documents with continuous context

SemanticSplitterNodeParser


Based on semantic segmentation (using a small model to judge)
✅ Advanced but slightly slower

✅ 4. Recommended segmentation strategy combination

  • Simple document: use SentenceSplitter + fixed chunk size

  • High context requirements: Use SentenceWindowNodeParser sentence window to keep context continuous

  • High-quality QA: SemanticSplitterNodeParser, semantic segmentation (requires additional installation of small models)


? 5. Detailed introduction of the splitter

1. Sentence Splitter

Prioritize complete sentences when parsing text; this class attempts to keep sentences and paragraphs together.

parameter:

nametypedescribedefault
chunk_sizeint

The token size of each block.

1024
chunk_overlapint

The tokens of each block overlap during the split.

200
separatorstr

The default word separator

' 'Space


separator is very important, common sentence ending punctuation in multiple languages:

  • The afternoon was interrupted at ".!\n"

  • English is interrupted at ".!?\n"

  • Spanish breaks at "¡¿"

paragraph_separatorstr

The separator between paragraphs.

'\n\n\n'
secondary_chunking_regex

str|

None

Backup regular expression for splitting sentences.

'[^,.;.! ]+[,.;.! ]?|[,.;.! ]'

 Install Dependencies

pip install llama-index llama-index-embeddings-huggingface

 split_text_demo.py

from llama_index.core import SimpleDirectoryReaderfrom llama_index.core.node_parser import SentenceSplitter# 1. Load documentsdocuments = SimpleDirectoryReader(input_files=[r"D:\Test\LLMTrain\day20_llamaindex\data\ai.txt"]).load_data()#print(f"Original number of documents: {len(documents)}")# 2. Create SentenceSplittersentence_splitter = SentenceSplitter( chunk_size=100, chunk_overlap=10, separator="。!!?.\n¡¿", # Suitable for separators in Chinese, English and Spanish )# 3. Split documentsnodes = sentence_splitter.get_nodes_from_documents(documents)print(f"Number of generated nodes: {len(nodes)}")print("Length of the first 3 parts of the segmented example:", [len(n.text) for n in nodes[:3]])print("\n? Segmentation result example:")for i, node in enumerate(nodes[:3]): print(f"\nChunk {i + 1}:\n{node}")

About chunk_overlap

chunk_overlap indicates the number of characters or tokens that overlap between two adjacent text blocks.

What it does is:

  • Preserve contextual continuity (avoid important information fragmentation)

  • Improve retrieval and generation quality (e.g. multi-round question answering)

Sometimes, even if the value is set, the separated statements do not overlap, because the "statement boundary" is prioritized for segmentation.

  • It is not strictly divided into blocks of fixed length by number of characters

  • So chunk_overlap means "overlap statements as much as possible" rather than precisely controlling character overlap.


2. Fixed block segmentation

TokenTextSplitter  splits according to a fixed number of tokens and is currently used in few scenarios.

parameter:

Nametypedescribedefault
chunk_sizeint

The token block size per block.

1024
chunk_overlapint

The tokens of each block overlap during the split.

20
separatorstr

The default word separator

' '
backup_separatorsList

Additional delimiter to use for splitting.

<dynamic>
keep_whitespacesbool

Whether to preserve leading/trailing whitespace in blocks.

False

token_text_splitter_demo.py

#Use fixed node segmentationfrom  llama_index.core  import  SimpleDirectoryReaderfrom  llama_index.core.node_parser  import  TokenTextSplitter
documents = SimpleDirectoryReader(input_files=[ r"D:\Test\LLMTrain\day20_llamaindex\data\ai.txt" ]).load_data()
fixed_splitter = TokenTextSplitter(chunk_size= 256 , chunk_overlap= 20 )fixed_nodes = fixed_splitter.get_nodes_from_documents(documents)print ( "Fixed block example: " , [ len (n.text)  for  n  in  fixed_nodes[: 3 ]])print ( print ( "First node content:\n" , fixed_nodes[ 0 ].text))print ( "===========" )print ( print ( "Second node content:\n" , fixed_nodes[ 1 ].text))

3. Sentence Window Segmentation

SentenceWindowNodeParser is an advanced text parser provided by LlamaIndex, designed for RAG (Retrieval-Augmented Generation) scenarios. Its core function is to split the document into sentences and attach contextual information of several sentences before and after each sentence node , thereby providing richer semantic background in the retrieval and generation stages. 

Core Features

  • Sentence-level segmentation: Split the document into independent sentences, with each sentence as a node.

  • Context window: Attach the contents of several sentences before and after each sentence node to form a "sentence window" to provide context information.

  • Metadata storage: The context window information is stored in the node's metadata for easy use in subsequent retrieval and generation stages.

parameter:

Name
type
describe
default
sentence_splitterOptional[Callable]

Split text into sentences

<function split_by_sentence_tokenizer.<locals>.<lambda> at 0x7b5051030a40>
include_metadatabool

Whether to include metadata in the node

Required
include_prev_next_relbool

Whether to include previous/next relationships

Required
window_sizeint
Specify the number of sentences before and after each sentence node
3
window_metadata_keystr
The metadata key name for storing context window information
'window'
original_text_metadata_keystr

The metadata key used to store the original sentence.

'original_text'


from  llama_index.core   import  Documentfrom  llama_index.core  import  SimpleDirectoryReaderfrom  llama_index.core.node_parser  import  SentenceWindowNodeParser
text =  "hello. how are you? I am fine! aaa;ee. bb,cc"
# Define sentence parsernode_parser = SentenceWindowNodeParser.from_defaults(    window_size = 3 ,    window_metadata_key= "window" ,    original_text_metadata_key= "original_text" ,)#print(node_parser)#documents = SimpleDirectoryReader(input_files=[r"D:\Test\LLMTrain\day20_llamaindex\data\ai.txt"]).load_data()
nodes = node_parser.get_nodes_from_documents([Document(text=text)])
print ([x.text  for  x  in  nodes])
print ( "-" * 20 )
print ( "Metadata of the first node" , nodes[ 0 ].metadata)
print ( "-" * 20 )print ( "Metadata of the last node" , nodes[ 4 ].metadata)

Result analysis:

  • It can be seen that the English string "hello. how are you? I am fine! aaa;ee. bb,cc" is split into five sentences. SentenceWindowNodeParser recognizes and splits sentences according to the punctuation marks at the end of the sentence : period (.), greeting (?), and exclamation mark (!) .

  • When the document is split, both the window data and the document data will be stored in the node's metadata and represented by the custom window_metadata_key and original_text_metadata_key.

  • We check the metadata of the first document. The first document is the first sentence of the original document, so the window data only contains the current sentence and the next two sentences, a total of three sentences.

  • The last document of the node. Because it is the last document, its window data only contains the first three sentences of the current sentence and the current sentence, a total of 4 sentences.


Notice:

The SentenceWindowNodeParser in LlamaIndex can only recognize half-width English punctuation marks, which will make it impossible to segment Chinese documents.

Solution: Replace all the full-width symbols (period, question mark, exclamation mark!) in the Chinese document with the corresponding half-width punctuation marks. Add an extra space after them, and the Chinese characters can be segmented. 

from  llama_index.core   import  Documentfrom  llama_index.core  import  SimpleDirectoryReaderfrom  llama_index.core.node_parser  import  SentenceWindowNodeParser
text =  "Hello, nice to meet you. It's already 10 o'clock, but I don't want to get up! It's snowing! Have you finished your homework?"text = text.replace( '.''. ' )text = text.replace( '!''! ' )text = text.replace( '''? ' )
# Define sentence parsernode_parser = SentenceWindowNodeParser.from_defaults(    window_size = 3 ,    window_metadata_key= "window" ,    original_text_metadata_key= "original_text" ,)#print(node_parser)#documents = SimpleDirectoryReader(input_files=[r"D:\Test\LLMTrain\day20_llamaindex\data\ai.txt"]).load_data()
nodes = node_parser.get_nodes_from_documents([Document(text=text)])
print ([x.text  for  x  in  nodes])
print ( "-" * 20 )
print ( "Metadata of the first node" , nodes[ 0 ].metadata)
print ( "-" * 20 )print ( "The metadata of the second last node" , nodes[ 2 ].metadata)

4. Semantic splitter

 Parameter Description

name
    type 
   describe
    default
buffer_size
    int
The number of sentences the model considers at a time
    1
embed_model
    BaseEmbedding 
   (BaseEmbedding): The embedding model to use    
Required
sentence_splitter
    Optional[Callable] 
   Split text into sentences    
<function split_by_sentence_tokenizer.<locals>.<lambda> at 0x7b5051032660>
include_metadata
 bool 
 Whether to include metadata in the node  
Required
include_prev_next_rel
bool
Whether to include previous/next relationships    
Required
breakpoint_percentile_threshold
 int 
The percentile of cosine differences between one set of sentences and the next that must be exceeded to form a node. The smaller this number, the more nodes are generated.
95

The role of buffer_size:

The main function of SemanticSplitterNodeParser is to divide the document into multiple nodes according to semantic similarity, and each node contains a set of semantically related sentences. In this process, buffer_size determines the number of sentences that the model considers each time when calculating semantic similarity .

For example, setting buffer_size=3 means that the model will evaluate the semantic similarity of 3 consecutive sentences as a unit each time. This helps determine whether breaks should be inserted between these sentences to form new nodes.

Parameter setting suggestions

  • Smaller buffer_size (such as 1 or 2): Suitable for documents with frequent content changes or loose structures, which helps to capture semantic changes more finely.

  • Larger buffer_size (such as 5 or 10): suitable for compact and semantically coherent documents, helping to reduce unnecessary segmentation.

from  llama_index.core.node_parser  import  SemanticSplitterNodeParserfrom  llama_index.embeddings.huggingface  import  HuggingFaceEmbeddingfrom  llama_index.core  import  SimpleDirectoryReader
embed_model = HuggingFaceEmbedding(model_name= r"D:\Test\LLMTrain\testllm\llm\BAAI\bge-m3" )
documents = SimpleDirectoryReader(input_files=[ r"D:\Test\LLMTrain\day20_llamaindex\data\ai.txt" ]).load_data()

parser = SemanticSplitterNodeParser(    embed_model=embed_model,    buffer_size = 2)
nodes = parser.get_nodes_from_documents(documents)
print ( f"? A total of  { len (nodes)}  semantic blocks were generated" )print ( f"\nSample block:\n {nodes[ 0 ].get_content()} " )

VI. Summary and Suggestions

Target
Recommend TextSplitter
Compatible with Chinese and English, fast and easy to use
SentenceSplitter ✅
Contextual continuity required
FixedWindowSplitter ✅
High-quality question answering + multi-semantic fusion
SemanticSplitterNodeParser (Advanced)