Woter AI detection.Hurry - ends Jul 21st

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

Using RAG technology to build an enterprise-level document question-answering system: Segmentation (5) Late Chunking

Written by

Jasper Cole

Updated on:July-15th-2025

Overview

Jina introduced a new text segmentation method in 2024, which was systematically explained in the paper Late chunking: contextual chunk embeddings using long-context embedding models.

Source: https://jina.ai/news/late-chunking-in-long-context-embedding-models/This article should be the first one to implement Late Chunking in Chinese. This article introduces the motivation, principle and Chinese implementation of this work as simply as possible, and analyzes the results.

motivation

It is very common to use some pronouns in the normal process of writing articles. Jina pointed out that due to segmentation, the text will be segmented into different segments, and some segments only have pronouns but no objects referred to by the pronouns, which may lead to retrieval failure.

For example, in the following text, the document is segmented as shown on the right side of the figure below. The user's question is "How many residents are there in Berlin?" Obviously, the second segment can answer 3.85 million, but because it uses Its, this segment may not be retrieved, or even if it is retrieved, the large model has no way to clearly know that Its refers to Berlin.

So, this gives us an idea for the paper. Can we do a unified reference resolution before segmentation, and then segment it, so that the second paragraph will become "Berlin's more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits." In this way, there will be no pronoun problems when searching. Interested friends can do an experiment.

principle

The general process of RAG is to segment the text first and then vectorize it. The solution provided by Jina is just as its name suggests - Late Chunking. It directly vectorizes the entire text first and then segments it.

How is it done specifically? First, take an entire document from the knowledge base and pass it through the Embedding model. Then, we will get the Embedding vector of each token position in the entire document. Then, according to the selected segmentation method, we take the vectors in a specific range and average them to get the Embedding vector of this fragment.

For example, "DeepSeek is a large language model developed by a subsidiary of Hangzhou Huanfang Quantitative Research Institute. It is really powerful!"

We assume that this passage looks like this after tokenization:

[, , , , , , , , , , , , , , , , , , , , , , , ]

There are 23 tokens in total. Set parameters for the Embedding model and output the hidden state of each question (also a vector, which is the Token Embedding in the figure below). If the dimension is 1024, it means that there will be a 1024-dimensional vector at the position of "DeepSeek", and the same is true for the position of "is". Because it is sent to the Embedding model as a whole sentence, the hidden state of the token "it" actually integrates the information of the entire sentence, and we know that "it" actually refers to DeepSeek. If we divide the sentence according to the hidden state of the corresponding position of "it is really strong" (positions 19-23 above) and then average it, the vector obtained after averaging actually expresses the meaning of "DeepSeek is really powerful".

There are three issues involved here:

Why do we take a whole document instead of the entire knowledge base? In reality, pronouns are rarely used across documents. Late Chunking aims to solve the problem of pronouns not knowing what they refer to due to segmentation. It makes no sense to take the entire knowledge base and splice it.
So how is the segmentation finally done? There are several methods in Jina's official code repository, namely by semantics, by sentence, and by token length. The Embedding of the segment after segmentation is the average of the Token Embeddings within the segmentation point range.
Aren't you worried about the entire document being too long when passing through the Embedding model? So here Jina requires that you try to use the Embedding model that supports long input. If it is still too long, split it according to the maximum length supported by the Embedding model (for example, 8192). Assuming the vector dimension is 1024, you will get 7 fragments after splitting by 8192. Finally, concatenate the 8192 vectors of the 7 fragments with a dimension of 1024, and then get the Embedding of the fragment according to the method mentioned in 2. There are actually many operational issues involved here. First, for example, if it is split by 8192, it is actually the length of the token that is 8192, not the sentence length. Secondly, the index of the split character must correspond to the index of the token. Finally, if it exceeds the maximum length of the Embedding model, you need to consider whether there is a special character in the first position of the model and remove it.

result

We used "Berlin" as the query and calculated the similarity with the following three sentences. We used two methods to vectorize these three sentences: one is the traditional method of first segmenting and then vectorizing, which is called Traditional, and the other is Late Chunking.

The first sentence does not have the problem of ambiguous reference, so the cosine similarities calculated by the two methods are very close, which is in line with expectations.

The second sentence contains the pronouns Its and it, and the third sentence contains The city. The cosine similarity calculated using the traditional method is relatively low, while the cosine similarity calculated using the Late Chunking method is relatively high, which reflects the advantage of Late Chunking.

Text	Similarity Traditional	Similarity Late Chunking
Berlin is the capital and largest city of Germany, both by area and by population."	0.84862185	0.849546
Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.	0.7084338	0.82489026
The city is also one of the states of Germany, and is the third smallest state in the country in terms of area.	0.7534553	0.84980094

What does high similarity mean? It means that when the knowledge base is relatively large, if you input "How many permanent residents are there in Berlin", using Late Chunking the second sentence will be ranked higher among the candidates, and thus have a greater probability of being recalled, while the traditional method will rank it at a relatively low position.

Effect

Although the motivation seems to be tenable and the principle seems to make sense, the experimental results are very poor. We will analyze this later.This is not a completely controlled variable experiment. There are two variables. In addition to the different segmentation methods, the vector models are also different. The rest of the generation model and evaluation model are consistent with other experiments.

Core code

The code of this article has been open sourced and is available at:https://github.com/Steven-Luo/MasteringRAG/blob/main/split/05_late_chunking.ipynb

The core of Late Chunking is not whether the segmentation is done first or last, but the vector representation in the segment, which should be able to integrate the context. The source code provides segmentation based on English sentences, token count, etc., which is quite different from Chinese habits, so the implementation in this article still segments by line breaks, but the embedding of each segment uses the semantic information of the entire document.

 (model, tokenizer, document, batch_size=): tokenized_document = tokenizer(document, return_tensors=) tokens = tokenized_document.tokens() outputs = [] i trange(, (tokens), batch_size): start = i end = (i + batch_size, (tokens)) batch_inputs = {k: v[:, start:end].to(device) k, v tokenized_document.items()} torch.no_grad(): model_output = model(**batch_inputs) outputs.append(model_output.last_hidden_state) model_output = torch.cat(outputs, dim=) model_output (token_embeddings, span_annotation, max_length=): outputs = [] embeddings, annotations (token_embeddings, span_annotation): (max_length ): annotations = [ (start, (end, max_length - )) (start, end) annotations start < (max_length - ) ] pooled_embeddings = [] idx, (start, end) (annotations): chunks[idx] == tokenizer.decode(doc_input_ids[start: end]), (end - start) >= : pooled_embeddings.append( embeddings[start:end].mean(dim=).cpu().numpy() ) : () pooled_embeddings = [ embedding / np.linalg.norm(embedding) embedding pooled_embeddings ] outputs.append(pooled_embeddings) outputs span_annotations = [] doc_input_ids = doc_tokens[][] start_pos = seperator_len = (tokenizer(, return_tensors=)[][]) - chunk_idx, chunk (chunks): chunk_input_ids = tokenizer(chunk, return_tensors=)[][][:-] chunk_token_len = (chunk_input_ids) (doc_input_ids[start_pos: start_pos + chunk_token_len] == chunk_input_ids).detach().numpy().mean() != : start_pos += (doc_input_ids[start_pos: start_pos + chunk_token_len] == chunk_input_ids).detach().numpy().mean() == , chunk_idx span_annotations.append((start_pos, start_pos + chunk_token_len)) start_pos += chunk_token_len document_embeddings = document_to_token_embeddings(model, tokenizer, processed_doc, batch_size=) late_embeddings = late_chunking(document_embeddings, [span_annotations])

Results Analysis

Considering that our test set is in Chinese and Jina's official code is in English, I once thought that there was a bug in my implementation, but after analyzing the English data, I found the same problem.

Only the most critical analysis is shown here. For more analysis, you can view the source code of the analysis part:https://github.com/Steven-Luo/MasteringRAG/blob/main/split/05_late_chunking_en_debug.ipynb

For a knowledge base fragment, if you use it as a query to search, and only keep the Top 1, in most cases you should be able to retrieve it yourself, but this is not the case in Late Chunking.

The following analysis uses part of the text of the DeepSeek entry in Wikipedia as the knowledge base, uses the code published by Jina to split it into fragments, and then takes each fragment as a query. The query vector is obtained through the Embedding model, and the similarity is calculated between the two vectors of the knowledge base fragments. From this result, it can be seen that the most similar fragment is not always itself. The first 5 sentences seem to be most similar to the first sentence.

Since Late Chunking averages the hidden state of each position in the entire segment, you can imagine that short sentences with more pronouns should be more similar to other sentences. For simplicity, we check the sentence length here.

In the figure below, the horizontal axis 0 means that the most similar sentence to each segment is not itself, and 1 means that it is itself. From the results, we can see that long sentences are generally most similar to themselves, while short sentences are most similar to other sentences. This is easy to understand because short sentences generally rely on the previous content as context, which contains relatively more pronouns.

Judging from the RAG full-process evaluation results and the English analysis results, this method does not seem to be a very general method to improve the segmentation effect. You are welcome to try it. If you find a bug in my code, you are welcome to give it feedback.

Previous Articles

Agent Series Articles

Langchain uses the Qwen model provided by Ollama to perform Function Call to implement weather query and network search

Langchain uses Qianwen official API for Function Call to implement weather query and network search

Use Llama3 8B provided by Ollama to build your own Stanford multi-agent AI town

Use Qwen2 7B provided by Ollama to build your own Chinese version of Stanford Multi-Agent AI Town

RAG Series Articles

Data preparation

Using RAG technology to build an enterprise-level document question-answering system: QA extraction

Baseline

Basic process of building an enterprise-level document question-answering system using RAG technology

Evaluate

Evaluation with TruLens

Evaluation using GPT4
Parsing optimization
Analysis (1) Convert PDF to Markdown using MinerU
Segmentation optimization
Segmentation (1) Markdown document segmentation
Segmentation (2) Using Embedding for semantic segmentation
Segmentation (3) Using Jina API for semantic segmentation
(4) Meta Chunking

Search Optimization

Retrieval Optimization (1) Embedding Fine-tuning

Search Optimization (2) Multi Query

Search Optimization (3) RAG Fusion

Search optimization (4) BM25 and hybrid search

Retrieval Optimization (5) Commonly used Rerank comparison

Retrieval Optimization (6) Rerank Model Fine-tuning

Search Optimization (7) HyDE

Search Optimization (8) Step-Back Prompting

Retrieval Optimization (9) Parent Document Retriever

Retrieval Optimization (10) Context Compression

Retrieval optimization (11) Context fragment number adjustment

Search Optimization (12) RAPTOR

Retrieval Optimization (13) Contextual Retrieval

Search optimization (14) CRAG——RAG that automatically determines whether to search online

Build Optimization

Generation Optimization (1) Long Context LLM vs. RAG

New Architecture

New Architecture (1) LightRAG

Build RAG with zero code using Flowise

Using Flowise to build RAG with zero code (1) - Basic process

Using Flowise to build RAG without code (2)——HyDE

Using Flowise to build RAG with zero code (3)——Reciprocal Rank Fusion