Understanding the large model RAG in one article: detailed technical explanation of retrieval, enhancement and generation

Written by
Clara Bennett
Updated on:July-11th-2025
Recommendation

Learn more about RAG technology and explore how it can enhance the application value of large models in professional fields.

Core content:
1. The "illusion" problem faced by large models and its causes
2. The principle and advantages of RAG technology in solving the "illusion" problem
3. The application scenarios and actual effects of RAG technology

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

The wave of large language models (LLM) has swept almost all industries, but when it comes to professional scenarios or industry segments, general large models often face the problem of insufficient professional knowledge. Compared with the expensive "Post-Training" or "Supervised Fine-Tuning" (Supervised Fine-Tuning, SFT), the RAG-based technical solution has become a better choice.

In this article, the author will start with the problems solved by RAG and the simulation scenarios, summarize the relevant technical details in detail, and share them with you~

First look: Problems solved by RAG and simulation scenarios

The “hallucination” problem of large models

Before discussing the necessity of RAG technology, we first need to understand the famous "hallucination" problem in large models.

The so-called "hallucination" means that when a large model attempts to generate content or answer questions, the output is not completely correct, and may even be wrong, which is commonly referred to as "serious nonsense." Therefore, "this "hallucination" can be reflected in the misrepresentation and fabrication of facts, erroneous complex reasoning, or insufficient processing ability in complex contexts."

The main reasons for this "illusion" are:

  1. "Training knowledge is biased" : The massive amount of knowledge input when training a large model may contain erroneous, outdated, or even biased information. After being learned by the large model, this information may be reproduced in future outputs.
  2. "Over-generalized reasoning" : Large models try to learn the general rules and patterns of human language through a large amount of corpus, which may lead to the phenomenon of "over-generalization", that is, applying ordinary pattern reasoning to certain specific scenarios will produce inaccurate output.
  3. "Limited understanding" : Large models do not truly "understand" the deep meaning of training knowledge, nor do they have common human common sense and experience. Therefore, they may make mistakes in some tasks that require deep understanding and complex reasoning.
  4. "Lack of knowledge in specific fields" : Although the general large model has mastered a large amount of common human knowledge and has super memory and reasoning ability, it may not be an expert in a vertical field (such as medical or legal experts). When faced with some highly complex domain problems or problems related to private knowledge (such as introducing a new product of the company), it may fabricate information and output it.

In addition to the "hallucination" problem, large models may also have problems such as outdated knowledge, difficult to interpret outputs, and uncertain outputs.

This also determines the challenges faced by large models in large-scale commercial production applications: In many cases, we need large models to have not only understanding and creativity, but also extremely high accuracy. For example, in the fields of financial risk assessment, medical diagnosis, legal consultation, etc., any wrong output may lead to serious consequences. Therefore, solving the "hallucination" problem is the key to improving the practical application value of large models.

How does RAG solve the “hallucination” problem?

RAG (Retrieval-Augmented Generation) technology was born to solve some problems faced by large models in practical applications, especially the "hallucination" problem. Its basic idea can be simply stated as follows:

Combining traditional generative big models with real-time information retrieval technology, the big models are supplemented with relevant external data and context to help them generate richer, more accurate, and more reliable content. This allows the big models to rely on real-time and personalized data and knowledge when generating content, rather than just relying on training knowledge.

In other words, RAG adds a quick-search knowledge plug-in to the large model, allowing it to refer to the latest and authoritative information sources when faced with specific problems, thereby reducing the occurrence of erroneous outputs and "hallucinations".


To further help us understand the concept of RAG, let's take an example.

If the big model is likened to an excellent student who has received a lot of medical knowledge and treatment skills training, and the big model response process is likened to a medical exam, then this student may still be unfamiliar with some of the latest treatment methods during the exam, and he may make up answers based on his memory and reasoning ability (i.e., "hallucinations"), leading to errors. RAG will retrieve relevant information from the latest medical literature and provide it to students as a reference. In this way, students can answer questions based on the latest professional knowledge, avoiding "hallucinations" and improving the accuracy and reliability of their answers.

Simulating a simple RAG scenario

Suppose you need to develop an online self-service product consultation tool that allows customers to use natural language to conduct interactive product Q&A, such as "Please introduce the differences between your company's product and ×× product." In order to give customers a better experience, you decide to use a big model to construct such a consultation function and embed it into the company's official website. If you use a general big model directly, the result is likely to be as follows.

「User question:」

Please introduce Wheat Company’s latest mobile phone product, Wheat Pro. What is the difference between it and iPhone?

[Citation]

Not surprisingly, the big model obviously does not have your company's latest mobile phone product knowledge, so it cannot answer the customer's questions (sometimes it may try to make up answers). Now, if you use the idea of ​​RAG, you can first "retrieve the following relevant supplementary knowledge" from the company's private knowledge base :

Wheat Pro phone parameters overview:

  • System: Latest version of Android
  • Processor: High-performance eight-core processor
  • Display: 6.7-inch AMOLED full screen, 2K resolution
  • Rear camera: Main camera 64 million pixels, quad-camera system
  • Front camera: 32 million pixels, AI beauty, HDR function
  • Memory and storage: up to 12GB RAM and up to 512GB storage space
  • Battery: 5000mAh large capacity battery, supports fast charging
  • Network: Support 5G network, Wi-Fi6, Bluetooth 5.2
  • Other features: IP68 waterproof and dustproof, facial recognition and fingerprint recognition unlocking. With its powerful performance, clear display effect, excellent camera ability and long battery life, Xiaomai Pro provides users with a high-quality mobile phone experience.

You then assemble the retrieved supplementary knowledge into the prompt word, input it into the big model, and ask the big model to answer your question based on the provided knowledge. The big model cleverly "absorbs" the supplementary external knowledge, combines it with the knowledge it has already mastered, and successfully infers and gives the answer:

"RAG enhanced answer:"

The main differences between the Xiaomi Pro phone and iPhone 12 can be compared from the following aspects:

operating system:

  • The Xiaomi Pro uses the latest version of Android as its operating system, while the iPhone 12 runs Apple's own iOS operating system. This means that the user interface, application ecosystem, functions and customization of the two may be different. Processor performance:
  • The iPhone 12 Pro is equipped with a high-performance octa-core processor, while the iPhone 12 is equipped with Apple's own A14 Bionic chip. Although specific performance may require actual testing to compare, both should be able to provide a smooth user experience. Display:
  • The Xiaomi Pro uses a 6.7-inch AMOLED full screen with a resolution of 2K, while the iPhone 12 is equipped with a 6.1-inch screen. Other features:
  • Wheat Pro has a larger battery capacity (5000mAh), supports faster charging speed, and has a higher waterproof and dustproof rating (IP68).

RAG Basic Concepts

What is RAG?

"RAG (Retrieval-Augmented Generation)" is a natural language processing (NLP) model that combines retrieval and generation techniques. The model was proposed by Facebook AI and aims to improve the performance of generative models in processing open-domain question-answering, dialogue generation and other tasks.

The RAG model introduces an external knowledge base, uses a retrieval module (Retriever) to extract relevant information from a large number of documents, and passes this information to a generation module (Generator) to generate more accurate and useful answers or texts.

The core idea is to make up for the shortcomings of generative models (such as GPT-3, BERT, etc.) in dealing with knowledge-intensive tasks through the organic combination of retrieval and generation. In traditional LLM (large language model) applications, the model only relies on the knowledge learned during training to answer questions, which leads to problems such as difficulty in updating knowledge and possible outdated or inaccurate answers. The RAG system actively retrieves relevant information before generating answers, and provides real-time and accurate knowledge as context to the model, thereby significantly improving the quality and reliability of answers.

RAG is essentially a prompting project with the help of "plug-ins", but it is by no means limited to this. It is not just a simple splicing of external knowledge into the prompt words, but through a series of optimization methods, it ensures that the large model can better understand and utilize this external knowledge, thereby improving the quality of the output answer.

RAG Architecture

The technical architecture of the RAG model can be divided into two main modules: the retrieval module (Retriever) and the generation module (Generator).


「Search Module」

Responsible for performing efficient vectorized retrieval from a large-scale knowledge base or document collection using a pre-trained dual-encoder model to quickly find several documents or paragraphs that are most relevant to the query.

Generate Module

Generate the final answer or text based on the retrieved documents and input query. Use powerful generative models (such as T5, BART, etc.) to process the input to ensure that the generated content is coherent, accurate, and informative.

RAG Workflow

By combining retrieval enhancement technology, user queries are fused with information in external knowledge bases, and large language models are used to generate accurate and reliable answers. The following is the complete workflow of RAG:

  1. Knowledge Preparation

  • Collect knowledge documents: Collect relevant knowledge documents from sources such as internal enterprise documents, public data sets, and professional databases.
  • Preprocessing: Clean, deduplicate, and segment documents to ensure data quality.
  • Indexing: Segment the processed documents into units suitable for retrieval (such as paragraphs or sentences) and create an index for fast search.
  • Embedding and Indexing

    • Use embedding models: Use pre-trained embedding models (such as BERT, Sentence-BERT, etc.) to convert text into high-dimensional vector representations.
    • Storing vectors: Store the generated vectors in a vector database (such as FAISS, Elasticsearch, Pinecone, etc.) to build an efficient index structure.
  • Query Retrieval

    • User query vectorization: The user's natural language query is converted into a vector representation through an embedding model.
    • Similarity calculation: Calculate the similarity between the query vector and the stored vectors in the vector database (usually using cosine similarity or Euclidean distance).
    • Sorting of search results: Based on the similarity scores, select several most relevant documents or paragraphs as search results.
  • Tip Enhancement

    • Assemble prompt words: Combine the retrieved relevant document content with the original user query into a new input sequence.

    • Optimize prompt templates: Design prompt templates according to task requirements to ensure that the generation module can make full use of the retrieved information. For example:

      User query: Please explain the difference between the Xiaomi Pro and iPhone. Search results: Xiaomi Pro uses the Android system, is equipped with a high-performance octa-core processor, a 6.7-inch AMOLED screen, and a 5000mAh battery. Enhanced hint: Answer the question based on the following information: "Xiaomi Pro uses the Android system, is equipped with a high-performance octa-core processor, a 6.7-inch AMOLED screen, and a 5000mAh battery."

  • Generate Answers

    • Input enhanced prompts: Input the enhanced prompt template into the generation module (such as T5, BART, GPT, etc.).
    • Generate text: The generation module generates the final answer based on the prompt template, taking into account the retrieved knowledge and its own training knowledge.
    • Post-processing: Perform formatting and grammar checking on the generated answers to ensure the quality and consistency of the output.