Detailed explanation of the Reranker reranking model for RAG retrieval enhancement!

Written by

Clara Bennett

Updated on:July-11th-2025

What is the Reranker Model?

The ReRanker model is a model that re-ranks the results returned by RAG retrieval. It is the 2nd Retrieval model shown in the figure below. Specifically, the ReRanker model plays the role of the second stage in the RAG Pipeline, that is, after the initial retrieval step, it re-ranks the retrieved document chunks to ensure that the relevant document chunks are passed to the LLM for processing first.

Why do we need a Reranker model?

Before answering this question, let’s take a deeper look at the underlying issue.

RAG works by performing semantic search in large amounts of text documents, which can number in the billions. To achieve fast response times for large-scale searches, we typically use vector search techniques, which convert text into vectors, place them in a vector space, and compare their similarity to the query vector using metrics such as cosine similarity.

The premise of vector search is that vectors are needed. These vectors basically compress the meaning behind the text into vectors of fixed dimensions (such as 768 or 1536 dimensions), and this process inevitably leads to information loss. Therefore, it is often found that even the top-ranked documents may miss some key information.

If documents at lower positions contain relevant information that would help the LLM form a better answer, it is easy for that information to be ignored. What to do? A simple approach is to increase the number of documents returned, i.e. increase the top_k value, and pass them all to the LLM.

The metric we are interested in here is recall, or “how many relevant documents did we retrieve”. It is important to note that recall measures the fraction of relevant documents that the system was able to find, regardless of the total number of documents retrieved. Therefore, in theory, perfect recall can be achieved by returning all documents.

However, this is not feasible in practice. First, the large language model (LLM) has certain restrictions on the amount of input text, which we call the "context window" . Even models like Anthropic's Claude have huge context windows of up to 100K tokens, and the amount of input text cannot be increased indefinitely. Second, when the context window is filled with too many tokens, the recall ability of the large language model and the effectiveness of executing instructions will be affected. Studies have shown that overfilling the context window will reduce the model's ability to retrieve information in the window, thereby affecting the quality of generated answers.

In order to solve the contradiction between recall rate and LLM context window, the Reranker model provides an effective solution. The specific steps are as follows:

Maximize retrieval recall
❝
During the initial search phase, the recall of the search can be improved by increasing the number of documents returned by the vector database (i.e. increasing the top_k value). This means retrieving as many relevant documents as possible to ensure that no information that could help the LLM form a high-quality answer is missed.
❞
Re-rank and filter the most relevant documents
❝
In the second stage, the Reranker model is used to rerank the large number of retrieved documents. The Reranker model can more accurately evaluate the relevance of the query to the document, filter out the most relevant documents, and reduce the number of documents that are finally passed to the LLM. The key to this step is:
❞

Rerank: Rerank the documents based on their relevance scores to the query.
Filter: Keep only the most relevant documents, ensuring that they are within the context window of the LLM.

Ensure that LLM processes high-quality information
❝
Through the above two steps, the Reranker model not only improves the recall rate of retrieval, but also ensures that the documents passed to LLM are the most relevant. This enables LLM to generate more accurate and valuable answers based on high-quality information while avoiding the problem of context window overload.
❞

Principle of Reranker model

The re-ranking model (also known as Cross-Encoder) is a model that outputs a similarity score for a query and document pair. We use this score to re-rank the documents according to their relevance to the query.

Its essence is a two-stage retrieval system:

Phase 1: Fast Retrieval (Vector DB or Bi-Encoder Retrieval): Use Bi-Encoder or sparse embedding model to quickly extract a set of relevant documents from a large data set. The core goal of this phase is to efficiently narrow the search scope and ensure that large data sets can be processed in a short time. Bi-Encoder encodes the query and document into vectors respectively and calculates their similarity through metrics such as cosine similarity.
Phase 2: Accurate Reranking (Reranker / Cross-Encoder): Use the reranking model (Reranker) to rerank the documents extracted in the first phase. The Reranker model can more accurately evaluate the relevance of the query and the document, output their similarity scores, and rerank the documents based on the similarity scores, returning the top K most relevant documents. The goal of this phase is to improve the relevance of the retrieval results and ensure that the most relevant documents are passed to the large language model (LLM) first.

Why a two-stage strategy?

This is because retrieving a small number of documents from a large data set is much faster than reranking a large number of documents. In short, the reranker is slow, while the retriever is fast.

Although rerankers are slower, we still choose to use them because their accuracy far exceeds that of embedding models.

The fundamental reason for the low accuracy of the Bi-Encoder is:

The dual encoder encodes documents and queries into fixed-dimensional vectors (such as 768 or 1536 dimensions) respectively, which inevitably leads to information loss. The rich semantics of the text is compressed into a low-dimensional vector, which cannot fully retain all the potential meanings of the original text.
The dual encoder creates embeddings before the user asks a query, so it knows nothing about the specific content of the query. This means it can only generate a generalized, average meaning, and cannot be optimized for specific queries. This static embedding method limits its performance when processing complex queries.

The reranker (Cross-Encoder) can directly process the original text information in the large Transformer, avoiding the loss caused by information compression. It can directly analyze the original text of the query and document to ensure that all relevant information is fully considered.

However, although the reranker has higher accuracy, it also comes at the cost of requiring more time to generate similarity scores.

Therefore, in practical applications, we usually combine the advantages of both and adopt a two-stage retrieval strategy:

Phase 1: Use dual encoders to quickly retrieve a batch of candidate documents.
Phase 2: Use the reranker to rerank the candidate documents to ensure that the final returned documents are the most relevant.

This combination ensures the speed of retrieval and improves the accuracy of results. For example, when processing 40 million records, it may take more than 50 hours to return a query result if only the reranker is used; however, using dual encoders and vector search, the initial screening can be completed in less than 100 milliseconds. Subsequently, the reranker is used to fine-tune the small number of documents that have been screened, which not only improves efficiency but also ensures quality.