Semantic Chunking in RAG: Towards Better Contextual Retrieval

Written by
Caleb Hayes
Updated on:June-24th-2025
Recommendation

How does RAG technology optimize contextual retrieval through semantic chunking to achieve more accurate information responses?

Core content:
1. The importance of RAG technology combining large language models with external knowledge retrieval
2. The role of chunking technology in RAG and its impact on model performance
3. Key steps and methods for implementing semantic chunking

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

Retrieval-augmented generation (RAG) technology has emerged as a new force in the field of large language models. By combining the power of large language models (LLMs) with external knowledge retrieval, RAG enables models to generate accurate and well-founded responses, even in professional fields. Behind every outstanding RAG process, there is a "hero" who silently plays a key role: chunking technology, especially semantic chunking.

The RAG Ecosystem and the Role of Blockchain

RAG represents a major change in the way AI systems acquire and utilize knowledge. Traditional large language models rely only on their pre-trained knowledge, which may have limitations or time-limited issues. RAG solves this limitation well by retrieving relevant information from external resources (such as databases, documents, or the Internet) during the generation process. This external knowledge is like supplementary ammunition, greatly expanding the knowledge boundary of the model and enabling it to cope with a variety of complex problems.

Chunking is a crucial step in the RAG process. Chunking refers to the process of dividing a document into smaller units before embedding and indexing it. These chunks are retrieved at query time and input into the large language model to generate responses. However, chunking is not a simple cutting operation, and its method directly affects the performance of the RAG system. If the chunks are too large, they may not fit into the context window of the model, resulting in information loss; if the chunks are too small or improperly segmented, the semantic information will be destroyed, making it difficult for the model to understand and process, which in turn affects the quality of the final response.

Challenges of Blockchain

For example, let's say a medical article reads: "Batman primarily operates in Gotham City, a crime-ridden, corruption-ridden metropolis. His arch-nemesis, the Joker, thrives in chaos and unpredictability. Although Bruce Wayne funds many social programs in Gotham City, he struggles with his dual identities as a billionaire and a vigilante." If we use a simple chunking approach, we might split it into:

  • Block 1: "Batman primarily operates in Gotham City, which is a crime-ridden city"
  • Block 2: "A big city, full of corruption. His nemesis, the Joker,"
  • Block 3: "Thrives in chaos and unpredictability. Even though Bruce Wayne"
  • Block 4: "Funded many social programs in Gotham City, but he is still working for..."

At this point, if the user asks, "What makes Batman's life so contradictory?" the search engine may randomly obtain a chunk in the middle of the sentence, or miss key information about his dual identity, which may result in a vague or wrong answer. This clearly shows the problem of inappropriate chunking and highlights the importance and necessity of semantic chunking.


Detailed explanation of semantic segmentation

Semantic chunking aims to segment documents in a way that preserves the meaningful, self-contained context of each unit. It respects natural boundaries, such as paragraphs, sentences, or topics, ensuring that each chunk can independently answer relevant queries. Implementing semantic chunking usually involves the following key steps:

  1. Sentence boundary detection
    Accurately identifying the start and end of a sentence is the basis for preserving semantic integrity. Because a sentence is the basic language unit that expresses a complete thought, correctly demarcating sentence boundaries helps group related information together.
  2. Topic modeling or embedding-based segmentation
    Topic modeling can analyze document content and divide parts with similar topics into a block. Embedding-based segmentation uses the embedding vectors of words or sentences to determine the segmentation points by calculating the similarity between vectors, and segmentation is performed where the semantics change, making the semantics within each block more coherent.
  3. Preserving context with overlapping windows
    In order to avoid losing context information during the segmentation process, the overlapping window method is usually used. That is, there is a certain proportion of overlapping content between adjacent blocks, which can ensure that the previous and next information can be correlated when retrieving and processing blocks, and enhance the model's understanding of the context.

Comparison of block strategies

There are many common block partitioning strategies, which are introduced as follows from simple to highly semantic:

  1. Fixed size chunking (simple method)
    In Python's LangChain library, you can use CharacterTextSplitter Perform fixed-size chunking. The sample code is as follows:
    from langchain.text_splitter import CharacterTextSplitter
    splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
    chunks = splitter.split_text(document)
    The advantage of this method is that it is simple, direct and easy to implement. However, it has obvious defects. It may split the sentence in the middle, destroying the integrity and context coherence of the sentence and affecting the expression of semantics.
  2. Sentence-based chunking
    With the help of NLTKTextSplitter Sentence-based chunking can be implemented, the sample code is:
    from langchain.text_splitter import NLTKTextSplitter
    splitter = NLTKTextSplitter(chunk_size=3, chunk_overlap=1)
    chunks = splitter.split_text(document)
    This method can preserve sentence boundaries and ensure semantic integrity to a certain extent. However, it may still split topics during the chunking process, causing a topic to be scattered across multiple chunks, which is not conducive to the model's understanding and processing of the complete topic.
  3. Recursive chunking
    RecursiveCharacterTextSplitter Provides recursive block function, the code is as follows:
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    splitter = RecursiveCharacterTextSplitter(
     separators=["\n\n", "\n", ".", " ", ""],
     chunk_size=500,
     chunk_overlap=100
    )
    chunks = splitter.split_text(document)
    Recursive chunking attempts to split on larger boundaries (such as paragraphs, sentences, and words), which can achieve a good balance between chunk length and semantic preservation. However, it may still need to be fine-tuned according to specific application scenarios to achieve the best results.
  4. Embedding-based semantic chunking (advanced approach)
    This technique uses sentence embedding to segment text where semantics change. The sample code is as follows: 
    from sentence_transformers import SentenceTransformer, utilimport nltkmodel = SentenceTransformer('all-MiniLM-L6-v2')sentences = nltk.sent_tokenize(document)embeddings = model.encode(sentences)similarities = [util.cos_sim(embeddings[i], embeddings[i+1]) for i in range(len(embeddings)-1)]chunks = []chunk = [sentences[0]]for i, score in enumerate(similarities): if score < 0.6: # The threshold can be adjusted as needed chunks.append(" ".join(chunk)) chunk = [] chunk.append(sentences[i+1])if chunk: chunks.append(" ".join(chunk))

    Semantic chunking based on embedding can truly achieve semantic-level segmentation, and is effective for documents containing rich topics. However, it has high computational complexity, relatively slow processing speed, and more complicated implementation process.

Evaluating block quality

The quality of the block strategy directly affects all links downstream of the RAG system, so it is very important to evaluate the quality of the block. It can be evaluated from the following aspects:

  1. index
  • The degree of overlap of the blocks with the actual situation (such as using the Recall@k metric)
    The accuracy of the segmentation is measured by calculating the overlap ratio between the segmentation and the ideal segmentation (the actual situation). The higher the overlap, the closer the segmentation result is to the ideal state and the better it can retain relevant information.
  • Embedding consistency (similarities within blocks should be high)
    Evaluate the similarity between the embedding vectors of the text within the block. If the text similarity within the block is high, it means that the semantic coherence within the block is good, and the model is easier to understand and process.
  • Model answer accuracy (end-to-end RAG evaluation)
    By actually inputting queries, we can observe the accuracy of the answers generated by the model based on chunking. This is the most direct indicator for evaluating the impact of the chunking strategy on the overall performance of the RAG system.
  • tool
    • LangChain RAG Evaluator
      The evaluator provided by the LangChain library can easily evaluate the RAG system, including the evaluation of the block effect.
    • Ragas
      This is a toolkit specifically designed for evaluating RAG systems, which can analyze the quality of block segmentation from multiple dimensions.
    • Customized question-answer pairs with ground truth relevance labels
      By creating customized question-answer pairs and annotating the correlation between questions and answers, the performance of chunking strategies on specific tasks can be evaluated in a targeted manner.

    Best Practices

    To achieve effective semantic chunking, the following best practices need to be followed:

    1. Prefer sentence-based or semantic-aware chunking
      This approach can better preserve semantic information and improve the model's ability to understand context.
    2. Reasonable use of block overlap
      Typically, an overlap of 50-100 tags is appropriate. Block overlap ensures information coherence between adjacent blocks and avoids context loss due to segmentation.
    3. Adjust the block size according to the specific application scenario
      Different types of documents, such as legal documents and tweets, have different requirements for chunk sizes. Legal documents are usually complex and contain a lot of information, so they may require larger chunks; tweets, on the other hand, are short, so the chunk size should be smaller.
    4. Hierarchy-aware chunking using metadata (e.g. title, subtitle)
      Metadata can provide structural information about the document, helping to better consider the hierarchical structure of the document when chunking, making the chunking results more logical.
    5. Continuously evaluate, iterate, and retrain the retriever
      As data changes and application scenarios adjust, the chunking strategy may need to be continuously optimized. By continuously evaluating the chunking quality, iterating and retraining the retriever, we can ensure that the RAG system always maintains good performance.

    The huge impact of semantic chunking in reality

    Semantic chunking is crucial to the actual RAG system and can even determine the success or failure of the system. Taking an enterprise application case (legal contract question-answering robot) as an example, after switching from simple chunking to recursive + semantic chunking, significant results were achieved:

    1. The accuracy of answers increased by 23%
      Semantic chunking enables the robot to understand the context of the question more accurately, retrieve more relevant information from the contract document, and generate more accurate answers.
    2. 41% reduction in hallucinations
      Hallucination is a common problem in generative models, where the model generates seemingly plausible but actually false information. Semantic chunking effectively reduces this phenomenon by providing more accurate context.
    3. The search engine hit rate increased from 62% to 87%.
      Semantic chunking optimizes the content and structure of chunks, enabling the search engine to match user queries more accurately and greatly improving the hit rate.

    Semantic chunking is an indispensable key link in RAG technology. It improves the effect of contextual retrieval by optimizing the document segmentation method, thereby significantly improving the performance of the RAG system. With the continuous development of artificial intelligence technology, semantic chunking technology will continue to evolve and improve, providing strong support for applications in more fields. Whether it is developing internal knowledge robots or building intelligent assistants in specific fields, in-depth understanding and application of semantic chunking technology will bring huge advantages and promote the development of artificial intelligence applications in a more intelligent and efficient direction.