Based on Embedding segmentation - Text segmentation (Text Splitting), an important part of RAG

Written by
Audrey Miles
Updated on:June-29th-2025
Recommendation

SemanticChunker: A new breakthrough in text segmentation, optimizing the efficiency and accuracy of the RAG model.

Core content:
1. The main application scenarios and advantages of SemanticChunker
2. Core principle: Text segmentation strategy based on semantic coherence
3. Practical application: parameter settings and usage examples

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)


-text-

SemanticChunker is a text chunking strategy that aims to maintain semantic coherence. It understands text semantics through an embedding model and can better maintain paragraph integrity compared to traditional character segmentation methods. This article introduces its principles, usage scenarios, advantages and disadvantages, and application suggestions in detail to help developers choose the most appropriate text chunking strategy and improve the efficiency and accuracy of RAG applications.

  • 1. The main purpose of SemanticChunker
  • 2. Core Principles
  • 3. Flowchart
  • 4. Use of SemanticChunker
    • 1. Detailed explanation of initialization parameters
    • 2. Comparison of breakpoint threshold types
    • 3. Complete usage example
  • 5. Advantages and disadvantages analysis
  • Summarize

-- To receive the learning materials package, see the end of the article

Among the core steps of RAG, there is a crucial step: "Text Splitting" .

Its main function is to divide a large text into smaller and more reasonable segments so that the model can better understand, process or store the content.

If the entire article is not broken down, the granularity of the embedding is too coarse, and it is easy to make mistakes when answering questions. Therefore, whether the segmentation is good or not directly affects the relevance and accuracy of the final answer.

The LangChain framework provides a variety of text chunkers, includingRecursiveCharacterTextSplitterIt is widely used due to its recursive segmentation strategy based on characters and delimiters, which is fast and low in resource consumption.

However, this purely structure-based chunking approach may sometimes cut off semantically complete paragraphs.

To solve this problem, you can useSemanticChunker, which utilizes embedding models for semantic understanding and aims to intelligently divide boundaries based on the semantic relevance of the content, thereby better maintaining the semantic integrity of the text.

1. The main purpose of SemanticChunker

SemanticChunker It is a  text splitter based on semantic similarity  . Its core advantage is that it can intelligently identify semantic boundaries in texts. It is mainly suitable for the following scenarios:

  • Maintain semantic coherence : When you need to split a long text into multiple segments, but want to avoid disrupting the natural connection of the content, making sure the segments are neither too fragmented nor too contextual.
  • Optimizing RAG retrieval : As an ideal preprocessing step before embedding or vector retrieval, generating semantically complete text blocks helps improve the accuracy of subsequent retrieval and generation tasks.
  • Intelligent identification of topic changes : When you need to automatically detect places with semantic discontinuities (i.e. semantic “breakpoints” ) in text, such as the beginning of a new topic or a significant shift in context.
  • Beyond simple cutting : Compared with traditional methods based on a fixed number of characters or separators, it can dynamically determine the segmentation points according to the semantic "jumps" of the content, achieving intelligent segmentation that is more in line with human reading habits.

2. Core Principles

  1. First split it into sentences (regular sentences, such as . ? ! segmentation);
  2. Combine the sentences before and after each sentence (according to buffer_size parameter);
  3. Embed each group of sentences (using the passed Embeddings Model);
  4. Calculate the semantic difference between the previous and next sentences (specifically, the cosine distance);
  5. Segment according to the "jump" position of the semantic difference , for example: where the difference is particularly large → maybe the topic has changed, the content has jumped → just break it;
  6. The segmentation method supports several options:
  • percentile
  • mean + standard deviation (standard_deviation)
  • Mean + interquartile range
  • Gradient percentile

Related reading:

What is Embedding in Artificial Intelligence?

From a beginner to an expert in artificial intelligence: A simple understanding of cosine similarity

3. Flowchart



4. Use of SemanticChunker

1. Detailed explanation of initialization parameters

semantic_splitter = SemanticChunker(
    embeddings,                     # Required: embedding model instance
    buffer_size= 1 ,                  # Optional: context window size when combining sentences
    add_start_index = False ,          # Optional: whether to add the start index in the metadata
    breakpoint_threshold_type = "percentile" ,   # breakpoint threshold type
    breakpoint_threshold_amount= None ,   # breakpoint threshold value
    number_of_chunks= None ,         # Optional: desired number of chunks
    sentence_split_regex= r"(?<=[.?!])\s+" ,   # sentence segmentation regular
    min_chunk_size= None # Optional: minimum chunk size           
)

Note: The default sentence segmentation regular expression cannot segment Chinese characters well and needs to be customized, for example:sentence_split_regex=r"(?<=[.?!])\s*"

2. Comparison of breakpoint threshold types

type
Calculation method
Applicable scenarios
default value
percentile
Percentile (default 95%)
General scenarios
95
standard_deviation
Mean + N times standard deviation
Normally distributed data
3
interquartile
Mean + N times interquartile range
Data with outliers
1.5
gradient
Percentile of the rate of change of distance
Focus on semantic mutation points
95

3. Complete usage example

from  langchain.text_splitter  import  RecursiveCharacterTextSplitter
from  langchain_experimental.text_splitter  import  SemanticChunker
from  langchain_openai  import  OpenAIEmbeddings

# Create OpenAI embedding instance
embeddings = OpenAIEmbeddings(openai_api_key= "hk-iwtbie91e427" ,
                              model= "text-embedding-3-large" ,
                              base_url= "https://api.openai-hk.com/v1" )



# Read sample text
with  open( 'example_text_2.txt''r' , encoding= 'utf-8'as  file:
    text = file.read()

# Create two blockers
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 300 ,
    separators=[ "\n\n""\n""."","" """ ]
)

semantic_splitter = SemanticChunker(
    embeddings=embeddings,
    sentence_split_regex= r"(?<=[.?!])\s*" ,
    breakpoint_threshold_type =  "percentile" ,
    breakpoint_threshold_amount= 95
)

# Processing text using a recursive chunker
print( "The result of RecursiveCharacterTextSplitter: " )
recursive_chunks = recursive_splitter.split_text(text)
for  i, chunk  in  enumerate(recursive_chunks,  1 ):
    print( f"\nBlock  {i} :" )
    print(chunk)
    print( "-"  *  80 )

# Processing text using a semantic chunker
print( "\nSemanticChunker's block results:" )
semantic_chunks = semantic_splitter.split_text(text)
for  i, chunk  in  enumerate(semantic_chunks,  1 ):
    print( f"\nBlock  {i} :" )
    print(chunk)
    print( "-"  *  80 )

# Comparison of the number of output blocks
print( f"\nComparison of block numbers:" )
print( f"RecursiveCharacterTextSplitter:  {len(recursive_chunks)}  chunks" )
print( f"SemanticChunker:  {len(semantic_chunks)}  chunk" )

Block results:

The block result of RecursiveCharacterTextSplitter:

Block 1:
Three dimensions of technological development

The evolution of programming languages
Starting from the earliest machine language, programming languages ​​have gone through the development stages of assembly language and high-level language. In the 1950s, the emergence of early high-level languages ​​such as FORTRAN greatly improved programming efficiency. Subsequently, the object-oriented programming paradigm emerged, and languages ​​such as C++ and Java led the new direction of software development. In recent years, the popularity of modern languages ​​such as Python and Go reflects developers' pursuit of simplicity and efficiency. The return of functional programming and the emergence of new programming paradigms indicate that programming languages ​​are still innovating.
--------------------------------------------------------------------------------

Block 2:
The transformation of Web technology
The development of Internet technology began with simple HTML pages. In the Web 1.0 era, static web pages were the mainstream, and users could only passively receive information. Web 2.0 brought about an interactive revolution. AJAX technology made dynamic interaction possible, and social media and user-generated content changed the face of the Internet. In today's Web 3.0 era, semantic web technology and decentralized applications are reshaping cyberspace, and innovative technologies such as blockchain have brought new possibilities to the Web world.
--------------------------------------------------------------------------------

Block 3:
The development trend of cloud computing
Cloud computing technology has evolved from traditional hosts to virtualization and then to modern cloud services. The emergence of service models such as IaaS, PaaS, and SaaS has provided enterprises with flexible IT solutions. The popularity of container technology and microservice architecture has promoted the innovation of application deployment and expansion. The rise of edge computing has supplemented the shortcomings of traditional cloud computing, while the hybrid cloud strategy meets the dual needs of enterprises for flexibility and security. In the future, cloud native technology will continue to lead the direction of digital transformation.
--------------------------------------------------------------------------------

SemanticChunker's block results:

Block 1:
Three dimensions of technological development

The evolution of programming languages
Starting from the earliest machine language, programming languages ​​have gone through the development stages of assembly language and high-level language. In the 1950s, the emergence of early high-level languages ​​such as FORTRAN greatly improved programming efficiency. Subsequently, the object-oriented programming paradigm emerged, and languages ​​such as C++ and Java led the new direction of software development. In recent years, the popularity of modern languages ​​such as Python and Go reflects the developers' pursuit of simplicity and efficiency. The return of functional programming and the emergence of new programming paradigms indicate that programming languages ​​are still innovating. The road to change in Web technology
The development of Internet technology began with simple HTML pages. In the Web 1.0 era, static web pages were the mainstream and users could only passively receive information. Web 2.0 brought about an interactive revolution. AJAX technology made dynamic interaction possible. Social media and user-generated content changed the face of the Internet. In today's Web 3.0 era, semantic web technology and decentralized applications are reshaping cyberspace. Innovative technologies such as blockchain have brought new possibilities to the Web world. The development trend of cloud computing
Cloud computing technology has evolved from traditional hosts to virtualization and then to modern cloud services.
--------------------------------------------------------------------------------

Block 2:
The emergence of service models such as IaaS, PaaS, and SaaS has provided enterprises with flexible IT solutions. The popularity of container technology and microservice architecture has promoted the innovation of application deployment and expansion. The rise of edge computing has supplemented the shortcomings of traditional cloud computing, while the hybrid cloud strategy meets the dual needs of enterprises for flexibility and security. In the future, cloud native technology will continue to lead the direction of digital transformation. 
--------------------------------------------------------------------------------

Comparison of number of blocks:
RecursiveCharacterTextSplitter: 3 blocks
SemanticChunker: 2 pieces

5. Advantages and disadvantages analysis

advantage:

  • Smarter chunking to maintain semantic integrity
  • Suitable for processing complex and long texts
  • The results of block division are more in line with human understanding

shortcoming:

  • Need to call the embedded model, the processing cost is high
  • Relatively slow processing speed
  • Relying on external API services

Selection suggestions

  • If the project requires high accuracy of text understanding and has budget support, it is recommended to use SemanticChunker
  • If your project needs to process a large amount of text quickly, or the text structure is relatively regular, you can choose RecursiveCharacterTextSplitter
  • In practical applications, you can choose a suitable block splitter according to the needs of the specific scenario.

Summarize

Text chunking is a key step in optimizing RAG performance.

SemanticChunker It uses embedding models to understand semantics and intelligently maintain content coherence. It is suitable for processing long texts with complex and diverse topics, but it is costly and relies on external APIs.

RecursiveCharacterTextSplitter Recursive segmentation based on characters and separators is fast, low-cost, and can be done offline. It is suitable for structured text or scenarios that require fast processing, but may destroy semantics.

The choice of chunker depends on the project's specific requirements for accuracy, budget, processing speed, and text characteristics.

SemanticChunkerA variety of threshold calculation methods are provided (such as percentile, standard deviation, etc.), and the parameters need to be carefully tuned according to the text type to achieve the best effect.

Understanding the principles and applicable scenarios of different blockers will help build a more efficient and accurate RAG system.