Woter AI detection.Hurry - ends Jul 10th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

RAG implementation: a complete analysis of 4 strategies for text segmentation

Written by

Caleb Hayes

Updated on:June-29th-2025

1. What is RAG

RAG is an AI technology that combines retrieval and generation. Its process usually includes the following stages:

Text Splitting
Vectorized encoding (Embedding)
Store in vector database (such as FAISS / Chroma / Milvus, etc.)
Search for similar passages
Generate Answers

2. Why do we need to segment text?

Large models (such as GPT) cannot directly retrieve the entire document. We must first divide the document into chunks of appropriate size, and then embed each chunk . If we divide it too finely, we will lose the context; if we divide it too coarsely, the embedding will be inaccurate or exceed the context window.

Text segmentation is very important in RAG, because improper segmentation may:

✂️ Interrupt semantic units (e.g. cut a sentence in half)
? Reduce recall accuracy
? LLM’s understanding is incomplete and the output is poor

The goals of text segmentation are:

Divide large documents into several small chunks to facilitate embedding and subsequent retrieval.
Reasonable segmentation makes retrieval more accurate.

3. Text segmentation strategy

Three elements of block division:

elements	illustrate	Recommended value
Block size	The length of each paragraph	200-500 words
Block Overlap	Adjacent blocks have duplicate content	10%-20%
Segmentation basis	Divide by sentence/paragraph/semantic	Optimal semantic segmentation

Block strategy comparison table:

Strategy Type	advantage	shortcoming	Applicable scenarios
Fixed size	Simple implementation	May cut off the complete semantics	Technical Documentation
Split by paragraph	Maintain logical integrity	Paragraph length varies widely	Literary Fiction
Semantic Segmentation	Ensure content integrity	Large consumption of computing resources	Professional field documents

LlamaIndex provides multiple built-in TextSplitter classes to handle documents of different languages and structures.

Commonly used splitters:

TextSplitter Type	Applicable situations	Chinese support
SentenceSplitter	Segment segmentation (suitable for natural language)	✅ Very suitable for Chinese
TokenTextSplitter	Block by Token Quantity	✅ Precise control of LLM input
SentenceWindowNodeParser	Sentence Window Method (Overlapping Segments)	✅ Suitable for documents with continuous context
SemanticSplitterNodeParser	Based on semantic segmentation (using a small model to judge)	✅ Advanced but slightly slower

✅ 4. Recommended segmentation strategy combination

Simple document: use SentenceSplitter + fixed chunk size
High context requirements: Use SentenceWindowNodeParser sentence window to keep context continuous
High-quality QA: SemanticSplitterNodeParser, semantic segmentation (requires additional installation of small models)

? 5. Detailed introduction of the splitter

1. Sentence Splitter

Prioritize complete sentences when parsing text; this class attempts to keep sentences and paragraphs together.

parameter:

name	type	describe	default
`chunk_size`	`int`	The token size of each block.	`1024`
`chunk_overlap`	`int`	The tokens of each block overlap during the split.	`200`
`separator`	`str`	The default word separator	`' 'Space` separator is very important, common sentence ending punctuation in multiple languages: The afternoon was interrupted at ".!\n" English is interrupted at ".!?\n" Spanish breaks at "¡¿"
`paragraph_separator`	`str`	The separator between paragraphs.	`'\n\n\n'`
`secondary_chunking_regex`	`str\|` `None`	Backup regular expression for splitting sentences.	`'[^,.;.! ]+[,.;.! ]?\|[,.;.! ]'`

Install Dependencies

pip install llama-index llama-index-embeddings-huggingface

split_text_demo.py

from llama_index.core import SimpleDirectoryReaderfrom llama_index.core.node_parser import SentenceSplitter# 1. Load documentsdocuments = SimpleDirectoryReader(input_files=[r"D:\Test\LLMTrain\day20_llamaindex\data\ai.txt"]).load_data()#print(f"Original number of documents: {len(documents)}")# 2. Create SentenceSplittersentence_splitter = SentenceSplitter( chunk_size=100, chunk_overlap=10, separator="。！!?.\n¡¿", # Suitable for separators in Chinese, English and Spanish )# 3. Split documentsnodes = sentence_splitter.get_nodes_from_documents(documents)print(f"Number of generated nodes: {len(nodes)}")print("Length of the first 3 parts of the segmented example:", [len(n.text) for n in nodes[:3]])print("\n? Segmentation result example:")for i, node in enumerate(nodes[:3]): print(f"\nChunk {i + 1}:\n{node}")

About chunk_overlap

chunk_overlap indicates the number of characters or tokens that overlap between two adjacent text blocks.

What it does is:

Preserve contextual continuity (avoid important information fragmentation)
Improve retrieval and generation quality (e.g. multi-round question answering)

Sometimes, even if the value is set, the separated statements do not overlap, because the "statement boundary" is prioritized for segmentation.

It is not strictly divided into blocks of fixed length by number of characters
So chunk_overlap means "overlap statements as much as possible" rather than precisely controlling character overlap.

2. Fixed block segmentation

TokenTextSplitter splits according to a fixed number of tokens and is currently used in few scenarios.

parameter:

Name	type	describe	default
`chunk_size`	`int`	The token block size per block.	`1024`
`chunk_overlap`	`int`	The tokens of each block overlap during the split.	`20`
`separator`	`str`	The default word separator	`' '`
`backup_separators`	`List`	Additional delimiter to use for splitting.	`<dynamic>`
`keep_whitespaces`	`bool`	Whether to preserve leading/trailing whitespace in blocks.	`False`

token_text_splitter_demo.py

#Use fixed node segmentationfrom  llama_index.core  import  SimpleDirectoryReaderfrom  llama_index.core.node_parser  import  TokenTextSplitter
documents = SimpleDirectoryReader(input_files=[ r"D:\Test\LLMTrain\day20_llamaindex\data\ai.txt" ]).load_data()
fixed_splitter = TokenTextSplitter(chunk_size= 256 , chunk_overlap= 20 )fixed_nodes = fixed_splitter.get_nodes_from_documents(documents)print ( "Fixed block example: " , [ len (n.text)  for  n  in  fixed_nodes[: 3 ]])print ( print ( "First node content:\n" , fixed_nodes[ 0 ].text))print ( "===========" )print ( print ( "Second node content:\n" , fixed_nodes[ 1 ].text))

3. Sentence Window Segmentation

SentenceWindowNodeParser is an advanced text parser provided by LlamaIndex, designed for RAG (Retrieval-Augmented Generation) scenarios. Its core function is to split the document into sentences and attach contextual information of several sentences before and after each sentence node , thereby providing richer semantic background in the retrieval and generation stages.

Core Features

Sentence-level segmentation: Split the document into independent sentences, with each sentence as a node.
Context window: Attach the contents of several sentences before and after each sentence node to form a "sentence window" to provide context information.
Metadata storage: The context window information is stored in the node's metadata for easy use in subsequent retrieval and generation stages.

parameter:

Name	type	describe	default
`sentence_splitter`	`Optional[Callable]`	Split text into sentences	`<function split_by_sentence_tokenizer.<locals>.<lambda> at 0x7b5051030a40>`
`include_metadata`	`bool`	Whether to include metadata in the node	Required
`include_prev_next_rel`	`bool`	Whether to include previous/next relationships	Required
`window_size`	`int`	Specify the number of sentences before and after each sentence node	`3`
`window_metadata_key`	`str`	The metadata key name for storing context window information	`'window'`
`original_text_metadata_key`	`str`	The metadata key used to store the original sentence.	`'original_text'`

from  llama_index.core   import  Documentfrom  llama_index.core  import  SimpleDirectoryReaderfrom  llama_index.core.node_parser  import  SentenceWindowNodeParser
text =  "hello. how are you? I am fine! aaa;ee. bb,cc"
# Define sentence parsernode_parser = SentenceWindowNodeParser.from_defaults(    window_size = 3 ,    window_metadata_key= "window" ,    original_text_metadata_key= "original_text" ,)#print(node_parser)#documents = SimpleDirectoryReader(input_files=[r"D:\Test\LLMTrain\day20_llamaindex\data\ai.txt"]).load_data()
nodes = node_parser.get_nodes_from_documents([Document(text=text)])
print ([x.text  for  x  in  nodes])
print ( "-" * 20 )
print ( "Metadata of the first node" , nodes[ 0 ].metadata)
print ( "-" * 20 )print ( "Metadata of the last node" , nodes[ 4 ].metadata)

Result analysis:

It can be seen that the English string "hello. how are you? I am fine! aaa;ee. bb,cc" is split into five sentences. SentenceWindowNodeParser recognizes and splits sentences according to the punctuation marks at the end of the sentence : period (.), greeting (?), and exclamation mark (!) .
When the document is split, both the window data and the document data will be stored in the node's metadata and represented by the custom window_metadata_key and original_text_metadata_key.
We check the metadata of the first document. The first document is the first sentence of the original document, so the window data only contains the current sentence and the next two sentences, a total of three sentences.
The last document of the node. Because it is the last document, its window data only contains the first three sentences of the current sentence and the current sentence, a total of 4 sentences.

Notice:

The SentenceWindowNodeParser in LlamaIndex can only recognize half-width English punctuation marks, which will make it impossible to segment Chinese documents.

Solution: Replace all the full-width symbols (period, question mark, exclamation mark!) in the Chinese document with the corresponding half-width punctuation marks. Add an extra space after them, and the Chinese characters can be segmented.

from  llama_index.core   import  Documentfrom  llama_index.core  import  SimpleDirectoryReaderfrom  llama_index.core.node_parser  import  SentenceWindowNodeParser
text =  "Hello, nice to meet you. It's already 10 o'clock, but I don't want to get up! It's snowing! Have you finished your homework?"text = text.replace( '.' ,  '. ' )text = text.replace( '!' ,  '! ' )text = text.replace( '' ,  '? ' )
# Define sentence parsernode_parser = SentenceWindowNodeParser.from_defaults(    window_size = 3 ,    window_metadata_key= "window" ,    original_text_metadata_key= "original_text" ,)#print(node_parser)#documents = SimpleDirectoryReader(input_files=[r"D:\Test\LLMTrain\day20_llamaindex\data\ai.txt"]).load_data()
nodes = node_parser.get_nodes_from_documents([Document(text=text)])
print ([x.text  for  x  in  nodes])
print ( "-" * 20 )
print ( "Metadata of the first node" , nodes[ 0 ].metadata)
print ( "-" * 20 )print ( "The metadata of the second last node" , nodes[ 2 ].metadata)

4. Semantic splitter

Parameter Description

name	type	describe	default
buffer_size	int	The number of sentences the model considers at a time	1
embed_model	BaseEmbedding	(BaseEmbedding): The embedding model to use	Required
sentence_splitter	Optional[Callable]	Split text into sentences	<function split_by_sentence_tokenizer.<locals>.<lambda> at 0x7b5051032660>
include_metadata	bool	Whether to include metadata in the node	Required
include_prev_next_rel	bool	Whether to include previous/next relationships	Required
breakpoint_percentile_threshold	int	The percentile of cosine differences between one set of sentences and the next that must be exceeded to form a node. The smaller this number, the more nodes are generated.	95

The role of buffer_size:

The main function of SemanticSplitterNodeParser is to divide the document into multiple nodes according to semantic similarity, and each node contains a set of semantically related sentences. In this process, buffer_size determines the number of sentences that the model considers each time when calculating semantic similarity .

For example, setting buffer_size=3 means that the model will evaluate the semantic similarity of 3 consecutive sentences as a unit each time. This helps determine whether breaks should be inserted between these sentences to form new nodes.

Parameter setting suggestions

Smaller buffer_size (such as 1 or 2): Suitable for documents with frequent content changes or loose structures, which helps to capture semantic changes more finely.
Larger buffer_size (such as 5 or 10): suitable for compact and semantically coherent documents, helping to reduce unnecessary segmentation.

from  llama_index.core.node_parser  import  SemanticSplitterNodeParserfrom  llama_index.embeddings.huggingface  import  HuggingFaceEmbeddingfrom  llama_index.core  import  SimpleDirectoryReader
embed_model = HuggingFaceEmbedding(model_name= r"D:\Test\LLMTrain\testllm\llm\BAAI\bge-m3" )
documents = SimpleDirectoryReader(input_files=[ r"D:\Test\LLMTrain\day20_llamaindex\data\ai.txt" ]).load_data()

parser = SemanticSplitterNodeParser(    embed_model=embed_model,    buffer_size = 2)
nodes = parser.get_nodes_from_documents(documents)
print ( f"? A total of  { len (nodes)}  semantic blocks were generated" )print ( f"\nSample block:\n {nodes[ 0 ].get_content()} " )

VI. Summary and Suggestions

Target	Recommend TextSplitter
Compatible with Chinese and English, fast and easy to use	SentenceSplitter ✅
Contextual continuity required	FixedWindowSplitter ✅
High-quality question answering + multi-semantic fusion	SemanticSplitterNodeParser (Advanced) ✅