The “unsung hero” in the RAG system: How does the rearranger improve the accuracy of information retrieval?

Written by
Jasper Cole
Updated on:July-01st-2025
Recommendation

Explore how the rearranger in the RAG system improves the accuracy of information retrieval and gain a deeper understanding of its key role in screening high-quality information from massive data.

Core content:
1. The definition and role of the rearranger in the RAG system
2. Using the rearranger to reduce the "hallucination" phenomenon and save costs
3. How the rearranger compensates for the limitations of the embedding vector and improves retrieval quality

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

In the era of information explosion, we shuttle through massive amounts of data every day, trying to find truly valuable information. The emergence of the Retrieval Augmented Generation (RAG) system is like a beacon, illuminating our way forward in the ocean of information. But have you ever thought that the reason why the RAG system can accurately provide us with useful information is that there is actually a key "gatekeeper" behind it - the reranker. Today, let us explore the world of the reranker in depth and see how it plays an indispensable role in the RAG system.

1. What is the rearranger in RAG?

Imagine that when you enter a keyword in a search engine, the system will instantly retrieve thousands of relevant information from a huge database. However, the quality of this information varies. Some may be highly relevant to your needs, while others may be completely irrelevant. This is where the rearranger comes in.

The rearranger is like a strict quality inspector. It will perform secondary screening and sorting on the documents retrieved initially. In the initial search stage, the system may quickly find a batch of documents that are related to the query content through semantic search or keyword search. However, these documents are like unscreened raw materials, which may contain a lot of irrelevant information. The task of the rearranger is to select the part that best meets the user's query intent from these documents and put them at the front, thereby improving the quality of search results.

To give a simple example, suppose you are writing a paper on "the application of artificial intelligence in the medical field" and you search for relevant materials through the RAG system. The initial search may find you dozens of papers, news reports, and research reports, which may cover the application of artificial intelligence in medical imaging diagnosis, disease prediction, drug development, and many other aspects. But some content may simply mention the two words artificial intelligence and medical care, without exploring the relationship between them in depth. The rearranger will carefully analyze these documents and put those documents that really elaborate on the specific application cases, technical principles, and effect evaluation of artificial intelligence in the medical field in front, so that you can find the most helpful information faster.

Why use a re-arranger in RAG?

1. Reduce the phenomenon of “hallucination”

In the RAG system, there is a common problem called "Hallucination". Simply put, the system generates some answers that are inconsistent with the facts or meaningless. This is usually because the retrieved documents contain a lot of irrelevant information, which misleads the system when generating answers. The rearranger can effectively filter out these irrelevant documents, just like picking out the bad parts of ingredients, thereby reducing the occurrence of "hallucination".

2. Cost savings

You might think that since the RAG system retrieves documents so quickly, retrieving more documents is no big deal. But in fact, processing these documents requires a lot of computing resources and API call fees. If the most relevant documents can be accurately screened out through the rearranger, the amount of information that the system needs to process can be reduced, thus saving costs. This is like when you are shopping, if you can accurately find the products you want, you don’t need to waste time and energy browsing a large number of irrelevant products, and it also reduces shopping costs.

3. Make up for the limitations of embedding vectors

In the RAG system, embedding is a commonly used information representation method. It maps documents and query content into a low-dimensional vector space, and determines the relevance of documents to queries by calculating the similarity between vectors. However, this method also has some limitations. First, the embedding vector may not accurately capture the subtle differences in semantics. For example, the two sentences "I like to eat apples" and "I like to eat apple pie" are semantically related but also clearly different, but the embedding vector may not be able to distinguish them well. Secondly, compressing complex information into a low-dimensional space may lead to information loss. Finally, the embedding vector may have insufficient generalization ability when processing information beyond the scope of its training data. The rearranger can make up for these shortcomings. It uses more complex matching techniques to perform more detailed analysis and sorting of documents.

3. Advantages of the Rearranger

1. Bag-of-Words Embedding

The rearranger does not simply map the entire document to a single vector like the embedding vector. Instead, it breaks the document down into smaller units with contextual information, such as sentences or phrases. In this way, it can understand the semantics of the document more accurately. Just like when reading an article, we will not draw conclusions based on the title of the article, but will carefully read every paragraph and sentence in the article to better understand the main idea of ​​the article.

2. Semantic Keyword Matching

The rearranger combines powerful encoder models (such as BERT) with keyword-based techniques to capture the semantic meaning of documents while focusing on the relevance of keywords. This is like when we look for a book, we not only look at the title and synopsis, but also check the keyword index in the book to more accurately determine whether the book meets our needs.

3. Better generalization

Because the rearranger focuses on small units and contextual information in the document, it performs much better when dealing with documents and queries it has never seen before, just like an experienced detective who can infer the truth of a case based on subtle clues at the scene, even if he has never encountered a similar case before.

Types of Rearrangers

The world of rearrangers is rich and colorful, with new technologies and methods emerging all the time. Let's take a look at several common types of rearrangers.

1. Cross-Encoder

Cross-encoders are deep learning models that classify and analyze the query and document data pairs to gain a deep understanding of the relationship between them. This is like a professional translator who can not only understand the literal meaning of two languages, but also accurately grasp the semantic connection between them. Cross-encoders excel in accurate relevance scoring, but their disadvantage is that they require a lot of computing resources and are not very suitable for real-time applications.

For example, suppose we use FlashrankRerank as a reranker and combine it with ContextualCompressionRetriever to improve the relevance of retrieved documents. The code is as follows:

from  langchain.retrievers  import  ContextualCompressionRetriever
from  langchain.retrievers.document_compressors  import  FlashrankRerank
from  langchain_openai  import  ChatOpenAI
llm = ChatOpenAI(temperature= 0 )
compressor = FlashrankRerank()
compression_retriever = ContextualCompressionRetriever(
   base_compressor=compressor, base_retriever=retriever
)
compressed_docs = compression_retriever.invoke(
   "What did the president say about Ketanji Jackson Brown"
)
print([doc.metadata[ "id"for  doc  in  compressed_docs])
pretty_print_docs(compressed_docs)

This code uses FlashrankRerank to rerank the documents retrieved by the base retriever, sorting them by their relevance to the query "What did the president say to Ketanji Jackson Brown". Finally, it prints out the document ID and the compressed, reranked document content.

2. Multi-Vector Reranker

Multi-vector models, such as ColBERT, take a delayed interaction approach. The representations of the query and document are encoded independently, and their interaction occurs later in the processing. This approach allows the representation of the document to be pre-computed, which speeds up retrieval and reduces computational requirements.

The code to use the ColBERT rearranger is as follows:

pip install -U ragatouille
from  ragatouille  import  RAGPretrainedModel
from  langchain.retrievers  import  ContextualCompressionRetriever
RAG = RAGPretrainedModel.from_pretrained( "colbert-ir/colbertv2.0" )
compression_retriever = ContextualCompressionRetriever(
    base_compressor=RAG.as_langchain_document_compressor(), base_retriever=retriever
)
compressed_docs = compression_retriever.invoke(
    "What animation studio did Miyazaki found"
)
print(compressed_docs[ 0 ])

This code compresses and rearranges the retrieved documents through the ColBERT rearranger to answer the question "Which animation studio did Hayao Miyazaki found?" The output document content contains information about Hayao Miyazaki's founding of Studio Ghibli, as well as the studio's background and first film.

3. Fine-tuned LLM Reranker

Fine-tuning Large Language Models (LLMs) is key to improving their performance on re-ranking tasks. Pre-trained LLMs themselves are not good at measuring the relevance between queries and documents. By fine-tuning them on specific ranking datasets such as the MS MARCO paragraph ranking dataset, we can enhance their performance on document ranking.

There are two main types of supervised rearrangers, depending on the model structure:

  1. Encoder-Decoder Models: These models treat document ranking as a generative task and use an encoder-decoder framework to optimize the re-ranking process. For example, the RankT5 model is trained to generate tokens to classify query-document pairs as relevant or irrelevant.
  2. Decoder-only models: This approach focuses on fine-tuning models that only use a decoder, such as LLaMA. Models such as RankZephyr and RankGPT explore different ways to calculate relevance in this context.

By applying these fine-tuning techniques, we can improve the performance of LLMs in the re-ranking task, making them more effective in understanding and prioritizing relevant documents.

The code to use RankZephyr is as follows:

pip install --upgrade --quiet rank_llm
from  langchain.retrievers.contextual_compression  import  ContextualCompressionRetriever
from  langchain_community.document_compressors.rankllm_rerank  import  RankLLMRerank
compressor = RankLLMRerank(top_n= 3 , model= "zephyr" )
compression_retriever = ContextualCompressionRetriever(
   base_compressor=compressor, base_retriever=retriever
)
compressed_docs = compression_retriever.invoke(query)
pretty_print_docs(compressed_docs)

This code uses RankZephyr to re-rank the retrieved documents and select the three most relevant documents based on the relevance of the query. The output document content contains various information related to the query, such as imposing economic sanctions on Russia and closing US airspace to Russian flights.

4. Re-arrangement method using LLM as a “referee”

Large language models can autonomously improve document reranking through hinting strategies such as point-wise, list-wise, and pair-wise methods. These methods leverage the reasoning power of LLMs (using LLMs as “referees”) to directly assess the relevance of documents to queries. While these methods are competitive in terms of effectiveness, the high computational cost and latency associated with LLMs may hinder practical applications.

  1. Pointwise methods: Pointwise methods evaluate the relevance of a single document to a query. It includes two subcategories: relevance generation and query generation. Both methods are suitable for zero-shot document reranking, that is, ranking documents without prior training on specific examples.
  2. List-wise methods: List-wise methods rank a list of documents by including the query and the list of documents in the prompt. The LLM is then instructed to output the identifier of the re-ranked document. Due to the limited length of the input to the LLM, it is usually not possible to include all the candidate documents at once. To manage this situation, list-wise methods adopt a sliding window strategy. This method ranks a subset of documents at a time, moving the window from the back to the front, and only re-ranks the documents within the current window.
  3. Pairwise approach: In the pairwise approach, the LLM receives a prompt which contains a query and a pair of documents. The task of the model is to determine which document is more relevant. To aggregate the results, methods such as AllPairs can be used. AllPairs generates all possible pairs of documents and calculates a final relevance score for each document. Efficient sorting algorithms such as heap sort and bubble sort help to speed up the ranking process.

The code for point-by-point, list-by-list, and pair-wise rearrangement using OpenAI’s GPT-4-turbo model is as follows:

import  openai
# Set your OpenAI API key
openai.api_key =  'YOUR_API_KEY'
def pointwise_rerank (query, document) : 
   prompt =  f"Rate the relevance of the following document to the query on a scale from 1 to 10:\n\nQuery:  {query} \nDocument:  {document} \n\nRelevance Score:"
   response = openai.ChatCompletion.create(
       model = "gpt-4-turbo" ,
       messages=[{ "role""user""content" : prompt}]
   )
   return  response[ 'choices' ][ 0 ][ 'message' ][ 'content' ].strip()
def listwise_rerank (query, documents) : 
   # Use a sliding window approach to rerank documents
   window_size =  5
   reranked_docs = []
   for  i  in  range( 0 , len(documents), window_size):
       window = documents[i:i + window_size]
       prompt =  f"Given the query, please rank the following documents:\n\nQuery:  {query} \nDocuments:  { ', ' .join(window)} \n\nRanked Document Identifiers:"
       response = openai.ChatCompletion.create(
           model = "gpt-4-turbo" ,
           messages=[{ "role""user""content" : prompt}]
       )
       ranked_ids = response[ 'choices' ][ 0 ][ 'message' ][ 'content' ].strip().split( ', ' )
       reranked_docs.extend(ranked_ids)
   return  reranked_docs
def pairwise_rerank (query, documents) : 
   scores = {}
   for  i  in  range(len(documents)):
       for  j  in  range(i +  1 , len(documents)):
           doc1 = documents[i]
           doc2 = documents[j]
           prompt =  f"Which document is more relevant to the query?\n\nQuery:  {query} \nDocument 1:  {doc1} \nDocument 2:  {doc2} \n\nAnswer with '1' for Document 1, '2' for Document 2:"
           response = openai.ChatCompletion.create(
               model = "gpt-4-turbo" ,
               messages=[{ "role""user""content" : prompt}]
           )
           winner = response[ 'choices' ][ 0 ][ 'message' ][ 'content' ].strip()
           if  winner ==  '1' :
               scores[doc1] = scores.get(doc1,  0 ) +  1
               scores[doc2] = scores.get(doc2,  0 )
           elif  winner ==  '2' :
               scores[doc2] = scores.get(doc2,  0 ) +  1
               scores[doc1] = scores.get(doc1,  0 )
   # Sort documents based on scores
   ranked_docs = sorted(scores.items(), key= lambda  item: item[ 1 ], reverse= True )
   return  [doc  for  doc, score  in  ranked_docs]
# Example usage
query =  "What are the benefits of using LLMs for document reranking?"
documents = [
   "LLMs can process large amounts of text quickly." ,
   "They require extensive fine-tuning for specific tasks." ,
   "LLMs can generate human-like text responses." ,
   "They are limited by their training data and may produce biased results."
]
# Pointwise Reranking
for  doc  in  documents:
   score = pointwise_rerank(query, doc)
   print( f"Document:  {doc}  - Relevance Score:  {score} " )
# Listwise Reranking
reranked_listwise = listwise_rerank(query, documents)
print( f"Listwise Reranked Documents:  {reranked_listwise} " )
# Pairwise Reranking
reranked_pairwise = pairwise_rerank(query, documents)
print( f"Pairwise Reranked Documents:  {reranked_pairwise} " )

This code reorders a set of documents using pointwise, listwise, and pairwise methods. The output shows the relevance score of each document and the order of the documents after reordering using different methods.

(V) Rearrangement API

Private reranking APIs offer a convenient solution for organizations that want to enhance the semantic relevance of their search systems without making a large infrastructure investment. Companies like Cohere, Jina, and Mixedbread offer these services.

Cohere: Provides custom models for English and multilingual documents, automatically chunks documents, and normalizes relevance scores between 0 and 1.

Jina: Focuses on enhancing search results through semantic understanding and providing longer context length.

Mixedbread: Provides a set of open source reranking models that provide flexibility for integration into existing search infrastructure.

The code using Cohere is as follows:

pip install --upgrade --quiet cohere
from  langchain.retrievers.contextual_compression  import  ContextualCompressionRetriever
from  langchain_cohere  import  CohereRerank
from  langchain_community.llms  import  Cohere
from  langchain.chains  import  RetrievalQA
llm = Cohere(temperature = 0 )
compressor = CohereRerank(model= "rerank-english-v3.0" )
compression_retriever = ContextualCompressionRetriever(
   base_compressor=compressor, base_retriever=retriever
)
chain = RetrievalQA.from_chain_type(
   llm=Cohere(temperature= 0 ), retriever=compression_retriever
)

This code re-ranks the retrieved documents through Cohere's re-ranking model to answer the query about Ketanji Brown Jackson. The output includes the president's high praise for Ketanji Brown Jackson and the support she has received.

5. How to choose a suitable rearranger for RAG?

Choosing the best rearranger for your RAG system requires a combination of factors:

1. Improved relevance

The main goal of a rearranger is to improve the relevance of search results. Metrics such as Normalized Discounted Cumulative Gain (NDCG) or attribution can be used to evaluate the effect of a rearranger on relevance. Just like evaluating a chef’s cooking skills by whether the dishes he makes are to the taste of the customers, we also need to see whether the rearranger can actually improve the quality of search results.

2. Delay

Latency refers to the extra time that a rearranger adds to the search process. Make sure this time is within the acceptable range for your application requirements. If a rearranger improves relevance but takes too long, it may not be suitable for some real-time scenarios. Just like in a tense game, you need to make the right decision in a short time. If a tool takes too long to help, it may not be the best choice.

3. Contextual understanding

Consider the ability of the rearranger to handle contexts of different lengths. In some complex queries, you may need to consider longer context information, and some rearrangers may perform better in this regard. This is like when reading an article, some readers are better able to understand long and difficult sentences and complex contextual relationships in the article, while some readers may only be able to understand simple sentences.

4. Generalization

Make sure the permutator performs well on different domains and datasets to avoid overfitting. If a permutator performs well only on a specific domain or dataset but poorly on others, it may not be reliable. This is like a student who performs well on exams in only one subject but performs poorly on exams in other subjects, then he may not be a well-rounded student.

VI. Latest Research Progress

1. Cross-encoders emerge

Recent studies have shown that cross encoders are efficient and effective when used in conjunction with a powerful retriever. While the performance difference may not be obvious within the domain, the impact of the rearranger is more significant in out-of-domain scenarios. Cross encoders are generally able to outperform most LLMs (except GPT-4 in some cases) in rearrangement tasks and are more efficient.

VII. Conclusion

Selecting the right RAG rearranger is critical to improving the performance of the system and ensuring accurate search results. As the field of RAGs grows, a clear understanding of the entire process is essential for teams to build effective systems. By addressing challenges in the process, teams can improve performance. Understanding the different types of rearrangers and their pros and cons is essential. Careful selection and evaluation of the rearranger for RAGs can enhance the accuracy and efficiency of RAG applications. This thoughtful approach will lead to better results and a more reliable system.

In the world of RAG, the rearranger is like an unsung hero. Although it does not interact directly with users, it provides users with accurate and efficient information services behind the scenes. I hope that through a deeper understanding of the rearranger, we can better utilize the RAG system, swim unimpeded in the ocean of information, and find those truly valuable treasures of knowledge.