Recommendation
In-depth analysis of the indispensability of reranking in RAG technology, revealing the key to improving the performance of question-answering tasks.
Core content:
1. The difference between RAG and traditional search enhancement generation
2. The impact of recall and context window in RAG
3. Reranking as a solution to improve RAG performance
Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)
Today I would like to discuss in depth with you an important part of retrieval augmentation generation (RAG) - reranking.RAG technology has always attracted much attention, especially when it is combined with the large model (LLM). People are looking forward to it: finally, it can easily solve those complex question-answering tasks! However, the reality is often not as good as expected. After completing a RAG process, many developers will feel confused: why did it not work as expected?In fact, like most tools, RAG is simple to use, but it is not easy to master it. In fact, RAG does not just store documents in a vector database and then simply add an LLM on top of it. Although this approach may work in some cases, it does not always work.Therefore, today we will talk about what to do when the existing RAG process cannot achieve the desired results. If you often encounter poor performance of RAG, here is the easiest and fastest solution to implement - rerank .— 1 —
Recall and context window
Before we delve into the solution, let's first analyze the problems in the traditional RAG (Retrieval Augmentation Generation) process. In RAG applications, we often need to perform semantic search on massive text documents, the number of which may range from tens of thousands to tens of billions.
In order to ensure sufficient speed in large-scale searches, we usually use vector search technology . Specifically, we convert text into vector form, place them in the same vector space, and then use similarity measurement methods such as cosine similarity to compare the closeness between these vectors and the query vector.The key to vector search lies in the vectors themselves. These vectors essentially compress the "meaning" behind the text into a low-dimensional vector (usually 768 or 1024 dimensions). However, this compression process inevitably leads to the loss of some information .Due to information loss, we often encounter the situation that the first few documents returned by the vector search may miss important relevant information, and this relevant information may appear outside the top_k threshold we set.So, what should we do if these relevant information at the later positions can help our LLM (Large Language Model) generate better answers? The most direct way is to increase the number of documents returned (increase the top_k value) and pass all of these documents to the LLM.The metric in question here is recall , which tells us “how many relevant documents did we retrieve”. Recall does not take into account the total number of documents retrieved, so we can “manipulate” this metric by returning all documents to get perfect recall.However, we cannot return all documents. LLMs have a limit on the amount of text they can accept, which is called the context window . For example, an LLM like Anthropic's Claude can have a context window of up to 100K tokens, which means we can fit dozens of pages of text.So, can we improve recall by returning a large number of documents (though not all of them) and “filling up” the context window?The answer is no. We cannot use the context filling method because it will reduce the ** recall performance** of LLM . The recall rate of LLM mentioned here is different from the retrieval recall rate we discussed before.Studies have shown that as we put more tokens in the context window, the recall of LLM gradually decreases. As we fill up the context window, LLM is also less likely to follow instructions, so context filling is not a good solution.This raises a question: we can improve retrieval recall by increasing the number of documents returned by the vector database, but if we pass all of these documents to the LLM, we will hurt the LLM's recall. So, what should we do?The solution to this problem is to maximize the recall of the retrieval by retrieving a large number of documents, and then maximize the recall of the LLM by ** minimizing ** the number of documents passed to the LLM. To do this, we need to rerank the retrieved documents to keep only the most relevant documents for the LLM. The key to achieving this operation is ** rerank **.
— 2 —
The Power of Rerank
The reranking model, also known as the cross encoder , is a special kind of model. It takes a query and document pair as input and outputs a score representing the similarity . We use this score to rerank the documents based on their relevance to the query.
In modern retrieval systems, a two-stage retrieval strategy is usually adopted. In the first stage, a dual encoder or sparse embedding model is usually used to quickly retrieve a set of relevant documents from a large-scale dataset. These models can efficiently screen out a preliminary set of candidate documents from massive data.Then, in the second stage, the rerank model comes into play. Its task is to rerank the documents retrieved in the first stage in a more detailed manner. The rerank model uses more complex calculations to make a more accurate assessment of the relevance of the documents, thereby improving the quality of the final returned results.Search engineers have been using Rerank in two-stage retrieval systems for a long time. The core of this strategy is that it is much faster to retrieve a small number of documents from a large data set than to rerank a large number of documents. In short, Rerank runs relatively slowly, while the retriever runs very fast.The main reasons for using two-stage retrieval are as follows:First, the balance between efficiency and accuracyThe first stage (retrieval stage): using dual encoders or sparse embedding models, it is possible to quickly retrieve a set of relevant documents in a large-scale dataset. The design goal of these models is to efficiently screen out a preliminary set of candidate documents. Although their accuracy may be limited, they are very fast.Phase 2 (reranking phase): The Rerank model reranks the documents retrieved in Phase 1. Although Rerank runs slower, it can more accurately assess the relevance of documents, thereby improving the quality of the final returned results.Second, resource optimizationIf we directly re-rank large-scale data sets, the computational cost will be very high and inefficient. Through two-stage retrieval, a small number of documents are quickly screened out in the first stage, and these documents are finely ranked in the second stage, which can significantly reduce the consumption of computing resources.Third, flexibility and scalabilityThe two-stage retrieval system allows different models and techniques to be used in different stages. For example, an efficient vector retrieval model can be used in the first stage, and a more complex cross-encoder model can be used in the second stage. This phased strategy makes the system more flexible and can be optimized and expanded according to specific needs.The role of the reordering modelThe rerank model plays a vital role in the two-stage retrieval system. It reranks the documents retrieved in the first stage to ensure that the final returned documents are not only moderate in number but also more relevant. Specifically, the functions of Rerank include:Improve relevance : Through more complex calculations, Rerank can more accurately evaluate the relevance of documents to queries, thereby improving the quality of the final returned results.Optimizing context window : Since the context window of LLM is limited, Rerank can help us select the most relevant documents within the limited context window, thereby maximizing the performance of LLM.Reducing noise : Rerank can remove documents that were retrieved in the first stage but are actually less relevant, thereby reducing noise and improving the overall performance of the system.
Why use Rerank?
If Rerank is so slow, then why do we use it? The answer is simple: Rerank is much more accurate than the embedding model .
First, the limitations of dual encodersThe dual encoder model has lower accuracy for two main reasons:1. Loss caused by information compressionThe dual encoder needs to compress all possible meanings of a document into a single vector. This compression process inevitably leads to information loss because the dimension of the vector is fixed, while the meaning of the document may be multi-dimensional.For example, a document may contain multiple topics and details, but the dual encoder can only compress it into a vector of fixed dimension, which leads to the loss of some information.2. Lack of query context:When the dual encoder processes a document, it has no contextual information about the query because we have already created the embedding vector of the document before the user queries it. This means that the dual encoder cannot adjust the representation of the document based on the specific query.For example, the same document may have different relevances under different queries, but the dual encoder cannot dynamically adjust the document’s vector to adapt to different queries.Second, the advantages of RerankIn contrast, the Rerank model has significant advantages:1. Directly process raw informationThe Rerank model can directly process the original document and query information instead of relying on compressed vectors. This means less information loss and a more accurate assessment of the relevance of documents to queries.For example, the Rerank model can analyze specific sentences and paragraphs in a document instead of relying solely on a fixed vector representation.2. Dynamically analyze the meaning of the documentBecause Rerank is run at user query time, it can analyze the specific meaning of a document based on the specific query, rather than trying to generate a general, average meaning.For example, for a document containing multiple topics, Rerank can dynamically extract the most relevant parts to the query based on the user's query, thereby improving the accuracy of relevance assessment.Third, the cost of rerankAlthough the Rerank model has a significant advantage in accuracy, it also has an obvious cost - time .The dual encoder model compresses the meaning of a document or query into a single vector. When a user queries, the dual encoder processes the query in the same way as it processes a document.For example, suppose you have 40 million records, using the encoder model and vector search, the same operation can be completed in less than 100 milliseconds .The Rerank model needs to directly process the original documents and queries, which has higher computational complexity and is therefore slower.For example, using a small reranking model like BERT, on a V100 GPU, it may take more than 50 hours to return a query result .Summarize
Although the Rerank model runs slowly, its advantage in accuracy makes it indispensable in many scenarios. Through the two-stage retrieval system, we can quickly screen candidate documents in the first stage, and then use the Rerank model to fine-tune the ranking in the second stage, thereby significantly improving the quality of retrieval results while ensuring efficiency. This strategy is particularly important when dealing with complex question-answering and generation tasks, because it ensures that the documents returned in the end are not only moderate in number but also more relevant.