RAG Comprehensive Guide: Late Chunking vs Contextual Retrieval Solves Contextual Problems

Written by

Clara Bennett

Updated on:June-10th-2025

RAG is a technology that combines external knowledge base retrieval with generative models. However, the recent Agent, MCP hustle and bustle, including the popularity of the DS-R1 model, has reduced the popularity of RAG's technology a lot. Even when I was discussing basic technologies with some AI practitioners, everyone scoffed at RAG. In fact, RAG is not simple at all. Today, I will summarize the problem of how to choose the "right-handed and left-handed" that RAG often encounters, which is to give a summary of my own learning.

RAG system divides the knowledge base into many small segments in advance and converts it into vectors, and stores it into a vector database. When a user asks questions, the system retrieves text fragments similar to the query semantics and attaches them to the prompts to provide them to the generative model to improve the accuracy of the answer. This method allows the model to use external knowledge to answer questions, avoiding the possible omissions that the model can rely on training memory alone. However, RAG faces a key challenge: context loss and choice. Traditional RAG often loses important context information when encoding documents, which leads to the retrieved fragments that are not really relevant. In other words, if the documents are split and encoded, many semantic associations will be interrupted, and the system may not retrieve the correct knowledge, which reduces the overall performance of RAG. To solve this problem of missing context, two advanced strategies have been proposed recently - Late Chunking (delay segmentation) and Contextual Retrieval (context search), which improve the representation of vector embedding to document context from different angles.

Document segmentation strategy: Early Chunking vs. Late Chunking

When RAG preprocesses the knowledge base, it is necessary to split long documents into smaller chunks and then embed them. This is because many Transformer models have input length limitations (such as 512 or 1024 tokens) and cannot encode full, long documents at once. The traditional method is to slice first and then embed, which we call "Early Chunking " (pre-slice). However, this naive slicing strategy brings serious problems:

Context is lost: After the document is divided into small segments, the embedding model processes each segment independently, and the semantic associations for long distances are truncated. If the pronouns, abbreviations, etc., appear in a certain segment and refer to the concepts in another segment, the model often cannot accurately connect these references because it cannot see the full text. For example, "it" in a paragraph refers to "Berlin" mentioned in the previous paragraph, but the slicing model does not know what "it" specifically refers to when embedding the paragraph. As a result, the vector representation of each fragment covers only part of the meaning of the fragment, missing the background provided by the entire document.
Excessive information compression: Too short fragments will cause the embedded vector to over-compress information, and many details and subtle meanings spanning multiple fragments will be lost. The model may map a fragment to a vector that is semantically overly general, because fine information has been truncated and cannot be obtained.
Retrieval effect decreases: Due to the lack of global context, there is a lack of consistent semantic correlation between the embedded vectors of each segment, resulting in vector retrieval that cannot correctly evaluate which fragments are most relevant to the query. Retrieval sorting depends on embedding similarity. If the embedding is inaccurate, the relevant fragments may not get high scores, and the unrelated fragments may be retrieved incorrectly. This directly weakens the ability of the RAG system to return the correct answer.

Mathematically, assume that the ideal embedding of a complete document is $E(D)$, and then divide it into several fragments $D_i$ and embed it to obtain $E(D_i)$. Naive slicing is equivalent to approximating the semantic representation of the original document using these local vector sets. However, this approximation increases with the finer slicing granularity, because the overall document Coherence cannot be reconstructed simply by adding and reconstructing several isolated fragments. In addition, the embedding $E(D_i)$ of each fragment is not equivalent to the representation $E_(D_i|D)$ that the fragment should have in the full text context. That is to say, the fragment vector obtained when departing from the full text often does not capture the true semantics of the fragment in the original text. This distortion is even more severe, especially when the segmentation occurs midway through the sentence or across the paragraphs of the conceptual chain is cut off. The cleavage of the overall semantics and the distortion of the fragment semantics are the main challenges facing traditional naive chunking.

In order to solve the above problems, the industry has proposed the Late Chunking (delay segmentation) strategy. As the name suggests, the concept of Late Chunking is "embed first, then segment": keep the context information of the entire document during the embedding stage as much as possible, and then divide the embedding results into fragment vectors as needed. Specifically, this method usually involves the following steps:

Overall encoding: Enter the complete document (or the longest fragment) at a time, which supports long contexts, and obtains the embedded representation of each token in the document. For example, Jina AI's latest embedding model supports input of 8192 tokens at a time (equivalent to ten pages of text). In this step, the model is equivalent to "reading" the full text. Each token vector implies attention to other contents of the full text, thus integrating a certain global semantics.
Delay segmentation: After obtaining the Token-level vector representation of the entire article, divide these vectors into a collection of several paragraphs according to the predetermined boundary (such as paragraph or sentence boundary, or fixed length). It is equivalent to still being sliced, but slicing occurs after the embedding, not before. These boundaries can be selected by rules or algorithms, such as LangChain's recursive character splitter, NLTK/Spacy's sentence segmentation, or regular segmentation tools provided by Jina. In any case, the key is that the slicing operation is facing a vector sequence that already contains global context information.
Fragment pooling: For each segmented fragment, take the token vector it contains for pooling (usually average pooling) to generate the final embedded vector of the fragment. Since these token vectors take the full text context into account when calculating, the vectors of each fragment now not only reflect the local meaning of the fragment, but also implicit relevant information from other parts of the text.

Through Late Chunking, the embedding vector of each chunk has the "brand" of the full text. It is no longer a vector that is independently distributed (i.i.d.) from each other, but a conditionalization of the global context. This greatly reduces semantic loss: the original cross-fragment associations have been captured by the model during the embedding stage and propagated into the vectors of each related fragment.

For example, when dealing with articles about Berlin in Wikipedia, paragraphs containing the phrase "the city", if embedded in traditional methods, only generate vectors based on the meaning of "the city" itself. It is impossible for the model to know which city "the city" refers to. However, when using Late Chunking, since the model processes the entire article at once, the vector representation of the phrase "the city" will combine the "Berlin" information that appears before. The fragment vector obtained by pooling this phrase naturally contains the contextual relationship of "Berlin is that city", and also retains the local meaning of the fragment itself, "the city". The embedding thus obtained is more semantically accurate than the isolated fragment embedding, which not only captures the local meaning but also carries the macro document background.

Late Chunking processes the entire document in the embedding phase, so that each segment vector carries global context information (right), which is in contrast to the traditional method of embedding each segment separately, resulting in semantic fragmentation (left). The left side of the figure shows the process of naive segmentation first splitting and then embedding, and the right side shows the process of Late Chunking first embedding and then segmenting.

The advantages of Late Chunking are, first and foremost, contextual preservation. The full text can be encoded at one time by passing through the model and then simply splitting it to obtain rich fragment vectors. This process is relatively direct and efficient. Since each fragment "reads" the full text when generating vectors, the semantics are no longer one-sided, greatly reducing the risk of missing key information. Experiments show that after using Late Chunking, the accuracy of vector search to find relevant content is significantly improved: for example, for the query "Berlin", the fragment "the city is also one of the states..." that did not mention "Berlin" directly, under naive slicing, the cosine similarity with the query is only 0.708, while under Late Chunking, it is increased to 0.825. This means that after Late Chunking processing, the angle between the fragment vector and the "Berlin" query vector is significantly smaller and is easier to retrieve and is considered related. Late Chunking almost makes every fragment vector close to the ideal $E_(D_i|D)$, thus mitigating the negative impact of context loss.

The above is the complete algorithm from the paper, but it is important to know that this is a mechanism to insert the context of the entire document into each chunk

It should be noted that Late Chunking is not at no cost. It is limited by the size of the embedded model context window: the number of Tokens that the model can process at one time has an upper limit, and if the document is very long (more than the model's window), it still has to be split. In other words, the ability of Late Chunking relies on the ability of long context embedding models and can only be fully functional if the document length is within the affordable range of the model. Currently, most dedicated embedding models have limited context lengths, while some large LLMs (such as GPT-4 or Claude, etc.) have larger context windows, but embeddings that use LLM directly for vector search are neither economical nor optimal. Therefore, when faced with super-long documents, we need to find other ways to inject context into the fragments. This is exactly the scenario where the Contextual Retrieval strategy is shown.

Contextual Retrieval: Let each fragment "share its own context"

When the length of the document is far beyond the processing power of a single model, or we cannot replace the long context model, another idea is: artificially add its context information to each fragment. This is the core idea behind the Contextual Retrieval (Context Retrieval) strategy. Unlike Late Chunking, starting with the model mechanism, Contextual Retrieval is more like making an article in the preprocessing stage: through a language model or other means, each document fragment is equipped with an additional description to make it more "contextual" before embedding and retrieving.

Anthropic proposed this method in 2024 and proved its outstanding results. Specifically, Contextual Retrieval includes two key steps: Contextual Embeddings and Contextual BM25 Retrieval. First, for each document fragment, use a powerful large language model (such as Anthropic's Claude or OpenAI's GPT-4) to generate a concise illustrative context for the fragment based on the entire document content, and prefix it to the fragment text. Then, this new segment that "expanded context information" is embedded in the vector; at the same time, for the generated explanatory text, we can also establish a keyword search index (BM25) to accurately match the key terms in the query.

For example: Assume that the original segment is " The company’s revenue grew by 3% over the previous quarter. (The company’s revenue grew by 3% over the previous quarter.)". Just looking at this sentence, we don’t know which company “company” refers to, nor what quarter’s performance it is. For this clip, Contextual Retrieval will let the model read the complete file (such as a financial report) and then generate a brief description, such as: "This paragraph is excerpted from the second quarter financial report of ACME in 2023, with revenue of 3.14 million US dollars in the previous quarter. The company's revenue for the quarter increased by 3% compared with the previous quarter. The new clip now includes background information such as the company name and time period. We send this text, supplemented with the context, into the embedding model to obtain a vector, and at the same time include this complete text in the BM25 index. This means that in the future, no matter whether the user queries precise keywords such as "ACME Company 2023 Q2 revenue" or asks a vague but contextual relationship question like "what company's performance this quarter", our search library already has corresponding vectors and keywords that can match the relevant clips.

Contextual Retrieval preprocessing process schematic: For each document fragment, LLM reads the entire document to generate a general context, attaches it to the fragment, constructs it together into a "contextual culture fragment", and then sends it into the embedding model encoding and deposits it into the vector database. At the same time, these contextual culture texts also establish keyword indexes (such as BM25 ) for Exact match. This method ensures that even if the query uses different words or only mentions context information, the relevant fragments can still be found.

This method is equivalent to asking the large language model to be a "summary marker": Write a small context description for each isolated paragraph in the knowledge base, indicating why it comes from and what key information it covers. The effect of this is amazing: According to Anthropic's report, the introduction of Contextual After Retrieval, the RAG system's search failure rate dropped by about 49%. If combined with reordering (rerank) and other means, the overall search error rate can be reduced by 67%. In other words, the search miss caused by the lack of background in the past was almost half reduced. This improvement directly brought about improvements in downstream Q&A performance - the model answer errors decreased, and the content accuracy improved.

Why can a little "comment" to the fragment improve? From the perspective of vector space, the reason is not mysterious: the context description provides more discrimination information for fragment embedding, making the fragment vector that was originally far away from the query closer. For example, the pronoun mentioned earlier refers to the problem; if the query contains "Berlin" and the fragment itself only has "the City", the vectors of the two may be far away under simple embedding (because the model does not know that "the city" is Berlin). However, through context generation, we append "This fragment talks about the city of Berlin..." before the fragment, so the new vector of the fragment will naturally be close to the semantic position of the keyword "Berlin". This greatly improves the cosine similarity between the fragment and the query, and makes it more likely to be selected in the search. Similarly, the addition of BM25 ensures that some missing matches by the embedding model can be found, such as the code and name included in the query. Even if the vector model is not good at processing, keyword retrieval can be made up. Therefore, Contextual Retrieval is essentially combining semantic matching with exact matching to maximize the weakness of pure vector retrieval in context.

It should be pointed out that although Contextual Retrieval is effective, it is compared to Late Chunking overhead is greater. Because we call LLM once in the knowledge base to generate context, which is a considerable cost and time-consuming for documents. However, some engineering techniques can be used to reduce costs, such as Context cache. Anthropic utilizes a feature of its Claude Model: You can add the entire document. Load into the hidden (cache) of the model, and then reuse the context for each fragment without having to re-enter the full text every time. It is said that after using the cache, the cost of batch generation of the context can be reduced to 10% of the original. In addition, multiple fragments can be processed in parallel, or the model can generate multiple fragments at a time (although it is recommended to write too many summaries of different fragments at once to reduce quality, and Anthropic generates one by one). In short, Contextual Retrieval is more computationally complex than ordinary RAG, but under long documents (exceeding model length) or complex contexts, it provides a practical solution to maintain context consistency and proves to capture the correct information more efficiently in practice.

The challenge of multiple rounds of retrieval: convergence of dialogue contexts

Whether Late Chunking or Contextual Retrieval, most of the current discussions are for single-round Q&A or independent query scenarios. However, in practical applications, users often ask questions in multiple rounds of conversations. RAG faces new challenges in multiple rounds of conversations: How to use the dialogue context when searching? For example, the first question of the user is: “In what year was Berlin mentioned as the capital of Germany for the first time? ” The system retrieved and answered the relevant information. Then the user asked: “How much is its population? "—The 'it' here obviously still refers to Berlin. If the search module does not realize this and directly searches the vector database by querying "its population", it is very likely that nothing can be found, because the fragment will not contain sentences like "its population". This means that the search of the RAG system must have dialogue memory, be able to understand the omitted references in subsequent queries, and actively integrate the topic mentioned above (Berlin) into the new query vector or search keywords.

The current general practice of multiple rounds of search is to use dialogue history

Include: For example, splice previous users' questions or system answers into the current query and send them into the embedding model to generate a context-based query vector; or use the user's questions to digest them together, and replace "it" with a specific object (Berlin) and search. In addition, there are some multi-hop search strategies: when the system answers complex questions, it may search multiple times, and continuously use the results of the previous step to improve the next step of the query. In these multi-round and multi-step interactions, Late Chunking and Contextual Retrieval methods also need to be adjusted accordingly. Late Chunking can try to use a long context model in combination of dialogue history and new problems, so that the query embedding itself contains the dialogue context; while Contextual Retrieval can be considered to generate context for the dialogue, such as using LLM Summarizer summarizes the important information in the previous article and attaches it to the search process of the current query. In general, multiple rounds of scenarios strengthen the principle of "context is king": not only does the document require context, but the query itself also has context. How to make the search module fully understand the dialogue context is still a difficult point that needs to be overcome in RAG applications.

The dilemma of multimodal documents: semantic fusion of text and images

RAG systems were originally mainly used for text, but the real-world knowledge base is often multimodal: it contains various forms of data such as text, pictures, tables, audio, video, etc. For example, an enterprise's internal knowledge base may include PDF documents (rich in text descriptions and data tables), product pictures, design drawings, and even video tutorials. When we want to build a multimodal RAG system, the traditional text retrieval scheme must be expanded to understand and retrieve information across modalities. This brings two main challenges:

First of all, each modal has its own difficulties in embedding and understanding. Text can be embedded with language models, so what should I do if the image is? A common practice is to use pre-trained image embedding models (such as CLIP, etc.) to convert the image into a vector. However, the semantics of images are often different from the text description. How to ensure that the image embedding space and the text embedding space can be aligned and compared is a difficult problem. One way is to train a multimodal joint embedding model that can accept text and image input at the same time and map to the same vector space. However, training such a model requires a large amount of cross-modal data, and information loss may occur. >—In order to represent the image and text content at the same time in a vector, it is inevitable to compress some details. Another method is to embed separately: text is embedded with text models, images are embedded with image models, and vector indexes are established separately. But this search will create a new problem: when a user asks a question that may involve multimodal information (such as "Please give me the sales trend chart shown in this report"), how does the system compare text fragments and image vectors and decide which result is provide? Usually, a reordering or fusion step is needed to put the search results of different modalities together, and a learning model or heuristic rule to determine which one is more relevant.

Secondly, the contextual relationship between cross-modality is difficult to maintain. Imagine a document containing text descriptions and illustrations: the text says, "Sales are on the rise, as shown in the figure below, sales are on the rise", and the corresponding image is a line chart. If we embed text and images separately and each becomes an independent search unit, then the text fragment "sales are on the rise" itself does not mention which picture or what product sales are on; and although the image vector encodes the line trend, it does not know that the text describes it. When searching, both text and pictures may be found, but it is difficult for the generation model to automatically associate the two ( Especially when the graph and text are not returned in the same search window). To solve this semantic gap, we need a multimodal context fusion strategy. A practical way is to pre-block the document with graphs: for example, convert the image into an explanatory text (obtained through the image description algorithm), merge it with the surrounding text into a mixed fragment, and then embed it together. This is a bit similar to Contextual Retrieval. The idea: add text context to the image, so that its meaning is clear and then participate in the search. In addition, tools such as LangChain also support the combination of multimodality in the answer stage: find the relevant text paragraphs and pictures separately during the search stage, then convert the pictures into text descriptions, and then hand them over to LLM to generate the final answer. But no matter what, the search and context understanding of multimodal RAG is still in the early stage of exploration, and specialized context retention methods are needed to design specific context retention methods for different modalities. For example, some researchers tried to embed segmented subtitles on videos and generated scenario descriptions for each segment to help retrieve content across shots. It can be predicted that the multimodal document puts forward new requirements for Late Chunking and Contextual Retrieval : We must not only retain the context within the same modal, but also establish contextual associations between different modals. Challenges and solutions in this area are gradually emerging and improving.

Written in the end:

To truly achieve "comprehension and integration", the RAG system needs to continue to tackle the key points in multi-round dialogue and multi-modal search. This means that the model should not only remember the context of the document itself, but also the context in the interaction between humans and AI, and also be able to understand the correlation between different forms of information. Perhaps in the near future, we will see that the expanded version of Late Chunking can handle multimodal inputs, or the novel Contextual Retrieval can create dynamic indexes for conversation contexts in real time. What is certain is that with the continuous innovation of context acquisition methods, the RAG system will become more and more intelligent, truly "spuncture" in massive information, find the true knowledge on the needle tip, and provide users with both accurate and rich answers.