Understanding Retrieval-Augmented Generation (RAG) and Multi-Retriever Systems

Written by
Iris Vance
Updated on:July-08th-2025
Recommendation

Master the retrieval-enhanced generation technology to improve the factual accuracy and domain adaptability of text generation.

Core content:
1. Overview of RAG technology and its application in reducing hallucination phenomena
2. Analysis of RAG process and key steps, including query encoding and information retrieval
3. Practical cases and optimization methods of RAG in the field of financial question answering

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

 

Retrieval-augmented generation (RAG) is a powerful technique that enhances large language models (LLMs) by integrating external knowledge retrieval into the text generation process. RAG can reduce hallucinations, improve factual accuracy, and support domain-specific optimizations. This paper will explore the RAG process and its mathematical foundations, retrieval mechanisms (DPR, BM25), FAISS optimization, trade-offs, and its application to financial question answering (QA).

A simple understanding of RAG

Large language models (LLMs) such as GPT-4 are able to generate text based on patterns learned from massive amounts of data. However, they have a major limitation - they cannot access new or external knowledge in real time. This means that they sometimes produce "hallucinations", that is, generate information that sounds correct but is actually wrong.

Retrieval-augmented generation (RAG) addresses this problem by combining text generation with real-time information retrieval. RAG models not only rely on pre-trained knowledge, but also retrieve relevant documents from external sources (such as Wikipedia, research papers, financial reports, or databases) before generating answers. This makes their answers more accurate and up-to-date.

RAG Process: A Detailed Explanation

Retrieval-augmented generation (RAG)  enhances the capabilities of large language models (LLMs) by retrieving relevant information before generating an answer. This ensures that the output is more accurate, factual, and contextually relevant. RAG uses a structured process that can be broken down into four key steps to better understand how it works:

Step 1: Query Encoding

Query encoding converts the user's question into a searchable format. When a user asks a question or enters a query, the system does not simply treat it as normal text, but instead converts the query into a numerical format that can be efficiently compared with stored documents.

This conversion is done by AI-powered neural encoders such as  BERT (Bidirectional Encoder Representations)  or  SecBERT (a version of BERT specifically optimized for financial or security data) .

  • •  The encoded query is represented as a vector , which is essentially a list of values ​​that captures the semantic meaning of the input.
  • • This approach is more effective than direct keyword matching (such as searching only for specific words) because people ask questions in different ways and the same word can have multiple meanings. Vector encoding can retrieve information based on meaning, not just word matching.

For example , if a user asks  “How does inflation affect stock prices?” , the system converts the query into a dense vector that captures its core meaning. This way, even if the relevant documents use a different wording, such as  “the relationship between inflation and the stock market” , the system will still be able to find matching content.

Step 2: Information Retrieval

The goal of this phase is to find the most relevant information. Once a query is encoded, the system searches the document database for the best matches, ensuring that the model has access to authentic, up-to-date, factual material, rather than relying solely on knowledge learned during training.

There are two main search methods:

1. Dense Passage Retrieval (DPR)

  • • DPR uses neural networks to find the most relevant documents based on semantics, not just matching words.
  • • Both the query and the document are converted into vector embeddings , and the system retrieves the documents most similar to the query vector .
  • •  Suitable for : When documents are expressed differently but have similar core meanings, DPR can effectively match them.

2. BM25-Based Sparse Retrieval

  • • BM25 is a mathematical formula that ranks documents based on the relevance of keywords.
  • • It takes into account the frequency of keyword occurrence and where it is located (e.g., in the title or deep in the body copy).
  • • Unlike DPR, BM25  does not use AI for vector search , but relies on direct word matching .
  • •  Best for : Situations where an exact match of keywords is required, such as legal documents or financial reports.

Example : If a user asks  “What are the risks of investing in cryptocurrency?” , a search engine might find the following among millions of documents:

  1. 1.  A recent financial news article discussing the volatility of the cryptocurrency market.
  2. 2.  A government report warns of regulatory risks of cryptocurrencies.
  3. 3.  A blog by an investment expert analyzing common investment risks.

These retrieved documents will be integrated (Fusion) in the next step.

Step 3: Information Fusion

This phase involves fusing the retrieved information with the user query . Since the system may retrieve multiple relevant documents, it needs to decide how to effectively use this information . Simply feeding all the text into the language model is inefficient and may even confuse the AI.

Common fusion methods include:

  1. 1.  Concatenation : The retrieved documents are appended directly to the input query and then fed into the language model.
  2. 2.  Re-ranking : The system scores the retrieved documents , giving priority to the most relevant documents.
  3. 3.  Weighted Attention Mechanisms : Some RAG models highlight important information to make it more influential when generating answers.

Example : If a user asks  "How does the Federal Reserve's interest rate policy affect inflation?" , the system might retrieve the following four relevant documents:

  1. 1.  The Federal Reserve’s recent interest rate hike report .
  2. 2.  An economist's blog explaining inflation trends .
  3. 3.  A news article summarizing the impact of interest rates on consumer spending .
  4. 4.  A research paper analyzing historical inflation cycles .

The goal of the fusion phase is to determine which documents are most relevant and present them to the AI ​​in the best way to ensure that the generated answers are based on facts.

Step 4: Answer Generation

Finally, the retrieved and fused information will be used to generate the final answer . Unlike traditional chatbots, the RAG model not only relies on pre-trained knowledge, but can also refer to external documents in real time .

  • •  High-level language models (such as GPT-4, T5, or BART)  are responsible for generating the final answer.
  • •  AI combines information from multiple sources to ensure answers are more accurate and informative.
  • •  RAG avoids the phenomenon of “hallucination” (i.e., making up facts) by anchoring to retrieval data .

Example : User asks  "What are the latest trends in the current stock market?"

  1. 1.  The system retrieves the latest financial reports and news articles to ensure that the information source is reliable.
  2. 2.  The AI ​​then generates a clear, structured response :

    "As of March 2025, the S&P 500 Index has shown high volatility due to rising interest rates. Analysts expect further market volatility, especially in the technology and energy sectors. Recent reports from Bloomberg and CNBC show that AI-related stocks have seen strong earnings performance."

Without RAG, traditional AI may provide outdated information , but RAG  ensures the timeliness and accuracy of answers through real-time retrieval .

Mathematical formula for RAG

The four main steps of the RAG process described above can also be studied and understood mathematically. Next, we will step through the mathematical foundations of RAG and explain each formula in simple language. If you are not interested in the mathematical principles of RAG, you can skip this section and go directly to its working principles and applications.

Step 1: Query Encoding — Converting the question into a searchable format

When a user provides a query, the system needs to convert it into a machine-readable format . Instead of processing the text directly as ordinary words, the system encodes it into a dense vector representation , a structured numerical format that captures the semantic information of the query.

Mathematically, the process can be expressed as:

in:

  • • is the vector representation  of the query (i.e., the encoded query).
  • •  A neural encoder (such as  BERT, DPR or SecBERT ) that converts text into numerical embeddings.
  • • are the learned parameters  of the encoder , i.e., the parameters that the model optimizes during training.

Purpose : The encoded query q′q' acts as a **"search key"** to find the most relevant documents in the database.

Step 2: Retrieval Probability — Find the most relevant documents

Once the query is converted into a vector, the system searches for matching documents in a large knowledge base . The goal is to find the documents that are most similar to the query.

How to measure similarity? Similarity is usually calculated by **cosine similarity (cosine similarity) or dot product similarity (dot product similarity)**. The probability of retrieving a document dd can be expressed as:

in:

  • • : represents the similarity score between query q′q' and document dd .
  • • : Ensure that all similarity scores are positive and scaled appropriately.
  • • Denominator : Normalize all candidate documents so that the final probability value is between 0 and 1.

Effect : This formula ensures that the most relevant documents have a higher probability of being retrieved , thereby improving the retrieval accuracy of the system.

Step 3: Response Generation — Generate a coherent response

Once the system retrieves the most relevant documents, the LLM (Large Language Model) needs to generate an answer based on the query and the retrieved information .

Mathematically, this process can be expressed as:

in:

  • •  is the final generated answer sequence (i.e. the text output by the AI).
  • •  Represents the ttth word in the answer.
  • •  Represents previously generated words and ensures that the generated sentences are coherent .
  • •  is the original user query.
  • •  are the relevant documents retrieved.

Function : This generation process is carried out step by step to ensure that the output answer is both factually consistent and grammatically and semantically coherent .

Step 4: End-to-End Optimization — Let the model continue to improve

To ensure that the system generates the best answer, the model continuously optimizes itself and  is trained using Maximum Likelihood Estimation (MLE) .

Its objective function (loss function) can be expressed as:

in:

  • •  is the probability of generating the correct answer , that is, the possibility that AI generates the correct answer.
  • •  Functions are used to simplify calculations and make the learning process more stable.
  • •  Training dataset  Contains (query, document, correct answer)  triplets to ensure that the model learns the correct answer pattern.

Effect : By optimizing this objective function, the model can continuously learn and adjust weights to generate more accurate answers in the future .

Retrieval mechanism in RAG: Dense channel retrieval (DPR) and sparse search (BM25)

Dense Passage Retrieval (DPR)

DPR is a neural network-based retrieval method that uses deep learning models to understand the semantics behind words. Unlike traditional methods that are based only on keyword matching (such as BM25), DPR converts queries and documents into numerical representations (embedded vectors) and then calculates similarities to retrieve the most relevant documents.

How DPR works

DPR uses a two-step search process :

1. Encoding the Query and Documents

DPR uses a bi-encoder architecture, where two separate neural networks are used for:

  • •  Query encoder : Processing user queries 
  • •  Document Encoder : Working with documents 

Both transform the input into a high-dimensional vector (i.e., dense embedding).

2. Retrieval Using Similarity Matching

  • • Once both the query and the document are converted into vectors, the system calculates the similarity score between them .
  • • This similarity is usually calculated by **cosine similarity**:
  • • The document with the highest similarity will be retrieved.

BM25-Based Sparse Retrieval

BM25 is a statistics-based ranking algorithm that retrieves documents based on the frequency of keywords. BM25 belongs to the "bag-of-words model" , which means that it does not consider the semantics of words, but only focuses on their frequency of occurrence in the document .

How BM25 works

BM25 uses the following factors to rank documents:

  • •  Term Frequency (TF) : The number of times a keyword appears in a document.
  • •  Inverse Document Frequency (IDF) : How rare the keyword is in the entire dataset .
  • •  Document Length Normalization : Adjust the score based on the document length.

The BM25 calculation formula is as follows:

in:

  • • : Keywords in the query.
  • • : Controls the effect of word frequency (usually set to 1.2 or 2.0).
  • • : Controls document length normalization (usually set to 0.75).
  • • : Document length.
  • • : The average length of documents in the dataset.

FAISS: Accelerating DPR Retrieval Using Vector Search

Although DPR is powerful, searching for similar vectors in millions of documents is computationally expensive. FAISS (Facebook AI Similarity Search)  is an efficient vector search library that can significantly improve retrieval speed.

How FAISS works

FAISS employs three key optimization strategies:

  1. 1.  IVF (Inverted File Indexing)
  • • FAISS clusters similar documents first .
  • • When querying, it first finds the closest cluster and then searches only within that cluster instead of traversing all documents, which greatly speeds up retrieval.
  • 2.  HNSW (Hierarchical Navigable Small World Graphs)
    • • Use graph-based retrieval methods to find similar documents in constant time .
    • • Reduce computational effort by efficiently jumping nodes and avoiding traversing the entire dataset.
  • 3.  PQ (Product Quantization)
    • • Reduce memory consumption while maintaining high retrieval accuracy.
    • • Instead of storing the complete document vector, FAISS compresses it into smaller vectors for storage.

    Compare all 3:

    • • DPR is very useful when semantic understanding is important. It can retrieve conceptually similar documents even if the exact words are different.
    • • BM25 is better suited for fast, explainable keyword searches where exact term matching is important.
    • • FAISS is critical to improving the scalability and efficiency of DPR.

    Comparison of DPR, BM25 and FAISS

    method
    Applicable scenarios
    Advantages
    Disadvantages
    DPR (Dense Prediction Retrieval)
    Tasks that require semantic understanding
    Can retrieve semantically similar documents even if the vocabulary is different
    High computational cost and slow search
    BM25 (sparse search)
    Keyword matching task
    Fast calculation and easy to interpret
    Cannot understand semantics, limited to word matching
    FAISS (Accelerated DPR)
    Tasks that require large-scale retrieval
    Improve DPR scalability and reduce computational overhead
    Still relying on the vector trained by DPR

    RAG's Impact and Future Development

    RAG is revolutionizing AI-driven search and text generation by combining retrieval-based reasoning  and  advanced language modeling . It is particularly valuable in scenarios that require real-time, fact-based, domain-specific knowledge retrieval , such as:

    • •  Financial Research : Analyze market data and answer financial questions.
    • •  Legal Analysis : Analyze laws and regulations and provide compliance advice.
    • •  Medical Diagnostics : Generate disease analysis and diagnosis based on medical literature.
    • •  Academic Research : Help scholars search for papers and summarize research results.

    In addition, the Multi-Retriever approach further enhances RAG’s capabilities in Financial QA. For example, it can integrate structured regulatory data (such as IRS tax laws, SEC filings) and real-world news and expert opinions to ensure accurate and up-to-date answers.

    As AI continues to develop, the RAG model will become a key component of trustworthy and accurate AI applications . Whether answering complex financial questions, summarizing legal texts, or generating medical reports, RAG represents a major breakthrough in the field of knowledge-driven AI , making AI  not only fluent, but also more reliable and knowledge-driven .