Based on Embedding segmentation - Text segmentation (Text Splitting), an important part of RAG

SemanticChunker: A new breakthrough in text segmentation, optimizing the efficiency and accuracy of the RAG model.
Core content:
1. The main application scenarios and advantages of SemanticChunker
2. Core principle: Text segmentation strategy based on semantic coherence
3. Practical application: parameter settings and usage examples
SemanticChunker is a text chunking strategy that aims to maintain semantic coherence. It understands text semantics through an embedding model and can better maintain paragraph integrity compared to traditional character segmentation methods. This article introduces its principles, usage scenarios, advantages and disadvantages, and application suggestions in detail to help developers choose the most appropriate text chunking strategy and improve the efficiency and accuracy of RAG applications.
1. The main purpose of SemanticChunker 2. Core Principles 3. Flowchart 4. Use of SemanticChunker 1. Detailed explanation of initialization parameters 2. Comparison of breakpoint threshold types 3. Complete usage example 5. Advantages and disadvantages analysis Summarize
-- To receive the learning materials package, see the end of the article
Among the core steps of RAG, there is a crucial step: "Text Splitting" .
Its main function is to divide a large text into smaller and more reasonable segments so that the model can better understand, process or store the content.
If the entire article is not broken down, the granularity of the embedding is too coarse, and it is easy to make mistakes when answering questions. Therefore, whether the segmentation is good or not directly affects the relevance and accuracy of the final answer.
The LangChain framework provides a variety of text chunkers, includingRecursiveCharacterTextSplitter
It is widely used due to its recursive segmentation strategy based on characters and delimiters, which is fast and low in resource consumption.
However, this purely structure-based chunking approach may sometimes cut off semantically complete paragraphs.
To solve this problem, you can useSemanticChunker
, which utilizes embedding models for semantic understanding and aims to intelligently divide boundaries based on the semantic relevance of the content, thereby better maintaining the semantic integrity of the text.
1. The main purpose of SemanticChunker
SemanticChunker
It is a text splitter based on semantic similarity . Its core advantage is that it can intelligently identify semantic boundaries in texts. It is mainly suitable for the following scenarios:
Maintain semantic coherence : When you need to split a long text into multiple segments, but want to avoid disrupting the natural connection of the content, making sure the segments are neither too fragmented nor too contextual. Optimizing RAG retrieval : As an ideal preprocessing step before embedding or vector retrieval, generating semantically complete text blocks helps improve the accuracy of subsequent retrieval and generation tasks. Intelligent identification of topic changes : When you need to automatically detect places with semantic discontinuities (i.e. semantic “breakpoints” ) in text, such as the beginning of a new topic or a significant shift in context. Beyond simple cutting : Compared with traditional methods based on a fixed number of characters or separators, it can dynamically determine the segmentation points according to the semantic "jumps" of the content, achieving intelligent segmentation that is more in line with human reading habits.
2. Core Principles
First split it into sentences (regular sentences, such as . ? !
segmentation);Combine the sentences before and after each sentence (according to buffer_size
parameter);Embed each group of sentences (using the passed Embeddings
Model);Calculate the semantic difference between the previous and next sentences (specifically, the cosine distance); Segment according to the "jump" position of the semantic difference , for example: where the difference is particularly large → maybe the topic has changed, the content has jumped → just break it; The segmentation method supports several options:
percentile mean + standard deviation (standard_deviation) Mean + interquartile range Gradient percentile
Related reading:
What is Embedding in Artificial Intelligence?
From a beginner to an expert in artificial intelligence: A simple understanding of cosine similarity
3. Flowchart
4. Use of SemanticChunker
1. Detailed explanation of initialization parameters
semantic_splitter = SemanticChunker(
embeddings, # Required: embedding model instance
buffer_size= 1 , # Optional: context window size when combining sentences
add_start_index = False , # Optional: whether to add the start index in the metadata
breakpoint_threshold_type = "percentile" , # breakpoint threshold type
breakpoint_threshold_amount= None , # breakpoint threshold value
number_of_chunks= None , # Optional: desired number of chunks
sentence_split_regex= r"(?<=[.?!])\s+" , # sentence segmentation regular
min_chunk_size= None # Optional: minimum chunk size
)
Note: The default sentence segmentation regular expression cannot segment Chinese characters well and needs to be customized, for example:sentence_split_regex=r"(?<=[.?!])\s*"
2. Comparison of breakpoint threshold types
3. Complete usage example
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
# Create OpenAI embedding instance
embeddings = OpenAIEmbeddings(openai_api_key= "hk-iwtbie91e427" ,
model= "text-embedding-3-large" ,
base_url= "https://api.openai-hk.com/v1" )
# Read sample text
with open( 'example_text_2.txt' , 'r' , encoding= 'utf-8' ) as file:
text = file.read()
# Create two blockers
recursive_splitter = RecursiveCharacterTextSplitter(
chunk_size = 300 ,
separators=[ "\n\n" , "\n" , "." , "," , " " , "" ]
)
semantic_splitter = SemanticChunker(
embeddings=embeddings,
sentence_split_regex= r"(?<=[.?!])\s*" ,
breakpoint_threshold_type = "percentile" ,
breakpoint_threshold_amount= 95
)
# Processing text using a recursive chunker
print( "The result of RecursiveCharacterTextSplitter: " )
recursive_chunks = recursive_splitter.split_text(text)
for i, chunk in enumerate(recursive_chunks, 1 ):
print( f"\nBlock {i} :" )
print(chunk)
print( "-" * 80 )
# Processing text using a semantic chunker
print( "\nSemanticChunker's block results:" )
semantic_chunks = semantic_splitter.split_text(text)
for i, chunk in enumerate(semantic_chunks, 1 ):
print( f"\nBlock {i} :" )
print(chunk)
print( "-" * 80 )
# Comparison of the number of output blocks
print( f"\nComparison of block numbers:" )
print( f"RecursiveCharacterTextSplitter: {len(recursive_chunks)} chunks" )
print( f"SemanticChunker: {len(semantic_chunks)} chunk" )
Block results:
The block result of RecursiveCharacterTextSplitter:
Block 1:
Three dimensions of technological development
The evolution of programming languages
Starting from the earliest machine language, programming languages have gone through the development stages of assembly language and high-level language. In the 1950s, the emergence of early high-level languages such as FORTRAN greatly improved programming efficiency. Subsequently, the object-oriented programming paradigm emerged, and languages such as C++ and Java led the new direction of software development. In recent years, the popularity of modern languages such as Python and Go reflects developers' pursuit of simplicity and efficiency. The return of functional programming and the emergence of new programming paradigms indicate that programming languages are still innovating.
--------------------------------------------------------------------------------
Block 2:
The transformation of Web technology
The development of Internet technology began with simple HTML pages. In the Web 1.0 era, static web pages were the mainstream, and users could only passively receive information. Web 2.0 brought about an interactive revolution. AJAX technology made dynamic interaction possible, and social media and user-generated content changed the face of the Internet. In today's Web 3.0 era, semantic web technology and decentralized applications are reshaping cyberspace, and innovative technologies such as blockchain have brought new possibilities to the Web world.
--------------------------------------------------------------------------------
Block 3:
The development trend of cloud computing
Cloud computing technology has evolved from traditional hosts to virtualization and then to modern cloud services. The emergence of service models such as IaaS, PaaS, and SaaS has provided enterprises with flexible IT solutions. The popularity of container technology and microservice architecture has promoted the innovation of application deployment and expansion. The rise of edge computing has supplemented the shortcomings of traditional cloud computing, while the hybrid cloud strategy meets the dual needs of enterprises for flexibility and security. In the future, cloud native technology will continue to lead the direction of digital transformation.
--------------------------------------------------------------------------------
SemanticChunker's block results:
Block 1:
Three dimensions of technological development
The evolution of programming languages
Starting from the earliest machine language, programming languages have gone through the development stages of assembly language and high-level language. In the 1950s, the emergence of early high-level languages such as FORTRAN greatly improved programming efficiency. Subsequently, the object-oriented programming paradigm emerged, and languages such as C++ and Java led the new direction of software development. In recent years, the popularity of modern languages such as Python and Go reflects the developers' pursuit of simplicity and efficiency. The return of functional programming and the emergence of new programming paradigms indicate that programming languages are still innovating. The road to change in Web technology
The development of Internet technology began with simple HTML pages. In the Web 1.0 era, static web pages were the mainstream and users could only passively receive information. Web 2.0 brought about an interactive revolution. AJAX technology made dynamic interaction possible. Social media and user-generated content changed the face of the Internet. In today's Web 3.0 era, semantic web technology and decentralized applications are reshaping cyberspace. Innovative technologies such as blockchain have brought new possibilities to the Web world. The development trend of cloud computing
Cloud computing technology has evolved from traditional hosts to virtualization and then to modern cloud services.
--------------------------------------------------------------------------------
Block 2:
The emergence of service models such as IaaS, PaaS, and SaaS has provided enterprises with flexible IT solutions. The popularity of container technology and microservice architecture has promoted the innovation of application deployment and expansion. The rise of edge computing has supplemented the shortcomings of traditional cloud computing, while the hybrid cloud strategy meets the dual needs of enterprises for flexibility and security. In the future, cloud native technology will continue to lead the direction of digital transformation.
--------------------------------------------------------------------------------
Comparison of number of blocks:
RecursiveCharacterTextSplitter: 3 blocks
SemanticChunker: 2 pieces
5. Advantages and disadvantages analysis
advantage:
Smarter chunking to maintain semantic integrity Suitable for processing complex and long texts The results of block division are more in line with human understanding
shortcoming:
Need to call the embedded model, the processing cost is high Relatively slow processing speed Relying on external API services
Selection suggestions
If the project requires high accuracy of text understanding and has budget support, it is recommended to use SemanticChunker If your project needs to process a large amount of text quickly, or the text structure is relatively regular, you can choose RecursiveCharacterTextSplitter In practical applications, you can choose a suitable block splitter according to the needs of the specific scenario.
Summarize
Text chunking is a key step in optimizing RAG performance.
SemanticChunker
It uses embedding models to understand semantics and intelligently maintain content coherence. It is suitable for processing long texts with complex and diverse topics, but it is costly and relies on external APIs.
RecursiveCharacterTextSplitter
Recursive segmentation based on characters and separators is fast, low-cost, and can be done offline. It is suitable for structured text or scenarios that require fast processing, but may destroy semantics.
The choice of chunker depends on the project's specific requirements for accuracy, budget, processing speed, and text characteristics.
SemanticChunker
A variety of threshold calculation methods are provided (such as percentile, standard deviation, etc.), and the parameters need to be carefully tuned according to the text type to achieve the best effect.
Understanding the principles and applicable scenarios of different blockers will help build a more efficient and accurate RAG system.