Understanding RAG Part III: Fusion Search and Re-ranking

In-depth exploration of the upgraded version of the RAG system: fusion retrieval, improving response quality and contextual coherence.
Core content:
1. Fusion retrieval analysis: how to aggregate multiple information streams to enhance the RAG system
2. The difference between classic RAG and fusion retrieval: different methods for handling multi-document integration
3. Reranking: a mechanism to improve document relevance to optimize user query response
We have previously introduced what RAG is, its importance in large language models (LLMs), and what classic retriever and generator systems for RAG look like. The third article in this series explores an upgraded approach to building RAG systems: fusion retrieval.
Before we dive in, it's worth briefly reviewing the basic RAG scenario we explored in Part 2 of this series.
Fusion search analysis
Fusion retrieval methods involve fusing or aggregating multiple information streams in the retrieval phase of a retrieval augmentation generation (RAG) system. To recap, in the retrieval phase, the retriever — an information retrieval engine — receives the user’s original query into a large language model (LLM), encodes it into a vector numerical representation, and uses it to search for documents that strongly match the query in a large knowledge base. Afterwards, the original query is augmented by adding the resulting contextual information from the retrieved documents, and finally the augmented input is sent to the LLM to generate a response.
By applying a fusion scheme at the retrieval stage, more coherent and contextual background information can be added on top of the original query, further improving the final response generated by LLM. Fusion retrieval leverages the knowledge gained from multiple extracted documents (search results) and combines them into a more meaningful and accurate context. However, the classic RAG scheme we are already familiar with can also retrieve multiple documents from the knowledge base, not just a single document. So what is the difference between these two approaches?
The key difference between classic RAG and fusion retrieval lies in how multiple retrieved documents are processed and integrated to form the final response. In classic RAG, the retrieved document contents are simply concatenated, or at most extractive summaries, and then fed into the LLM as additional context to generate the response, without involving the application of advanced fusion techniques. In fusion retrieval, more specialized mechanisms are used to combine relevant information across multiple documents. This fusion process can occur at the enhancement stage (retrieval stage) or even at the generation stage.
- Fusion in the Enhancement Phase
This includes techniques that apply re-ranking, filtering, or merging of multiple documents before passing them to the generator. Two examples are re-ranking and aggregation: re-ranking refers to scoring and ranking documents based on relevance before feeding them into the model along with user prompts; aggregation refers to merging the most relevant parts of each document into a single context. Aggregation is achieved through classic information retrieval methods such as TF-IDF (term frequency-inverse document frequency), embedding operations, etc. - Fusion in the Generative Phase
It involves the LLM (generator) processing each retrieved document independently - including the user prompt - and fusing the information from several processing tasks when generating the final response. Broadly speaking, the enhancement stage in RAG becomes part of the generation stage. A common approach in this category is Fusion in Decoder (FiD), which allows the LLM to process each retrieved document separately and then combine their insights when generating the final response.
Reranking is one of the simplest yet effective fusion methods that can meaningfully combine information from multiple retrieval sources. The following section briefly explains how it works.
How reordering works
During the reranking process, the initial set of documents retrieved by the retriever is reranked to improve relevance to the user's query, thereby better meeting the user's needs and improving the overall output quality. The retriever passes the retrieved documents to an algorithmic component called a "ranker", which re-evaluates the retrieval results based on criteria such as learned user preferences and ranks the documents with the goal of maximizing the relevance of the results presented to a specific user. Scoring mechanisms such as weighted averages or other forms of scoring are used to combine and prioritize the highest-ranked documents, making the content of top-ranked documents more likely to be part of the final merged context than the content of lower-ranked documents.
The following diagram shows how the reordering mechanism works:
To better understand re-ranking, let’s describe an example in the context of East Asian travel. Imagine a traveler asks a RAG system for “best destinations for nature lovers in Asia.” The initial retrieval system might return a collection of documents, including general travel guides, articles about popular cities in Asia, and recommendations for nature parks. However, a re-ranking model can leverage additional traveler-specific preferences and contextual data (e.g., preferred activities, previously enjoyed activities, or visited destinations) to re-rank these documents to prioritize the most relevant content for that user. It might highlight tranquil national parks, lesser-known hiking trails, and eco-friendly travel routes that might not appear at the top of most people’s recommendation lists. In this way, it provides “straight to the point” results for nature-loving travelers like the target user.
In summary, reranking reorganizes multiple retrieved documents based on additional user relevance criteria, focusing the content extraction process on the top-ranked documents, thereby improving the relevance of subsequent generated responses.