RAG Architecture Overview: Finding the Most Suitable RAG Solution

Written by
Silas Grey
Updated on:June-20th-2025
Recommendation

Master RAG technology to improve the accuracy and reliability of information retrieval and generation.

Core content:
1. Overview of RAG technology and its architecture types
2. Working mechanism and application scenarios of standard RAG architecture
3. Optimization strategies and applications of corrective RAG, speculative RAG and fusion RAG

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

RAG technology integrates external knowledge source retrieval and model generation capabilities to enable language models to generate more accurate and reliable answers based on real-world information. Today, RAG technology continues to evolve and has spawned a variety of distinctive architecture types, each optimized for specific scenarios and needs. A deep understanding of these different types of RAG architectures is crucial for developers, data scientists, and AI enthusiasts, as it can help them make more appropriate technology choices in their projects and give full play to the advantages of RAG.

1. Infrastructure: Standard RAG

Standard RAG is the cornerstone of the entire RAG technology system. It adopts a classic architecture design and combines the retriever and the generator. During the working process, the retriever is responsible for filtering out documents related to the user's questions from the huge knowledge base. These documents will be divided into small pieces that are easy to process to ensure the efficiency and pertinence of the retrieval. Subsequently, the generator (such as powerful language models such as GPT-4) analyzes and understands the relevant information retrieved, and then generates accurate and valuable answers.

This architecture has significant characteristics and advantages. On the one hand, it can process documents in reasonable blocks, greatly improving retrieval efficiency, allowing the system to quickly locate the most relevant information and provide strong support for subsequent generation work; on the other hand, it is very suitable for scenarios with high requirements for response time, and can generally give answers within 1-2 seconds to meet the needs of real-time interaction.

In actual project applications, standard RAG has a wide range of uses. For example, when building a customer support chatbot, it can quickly obtain accurate answers from FAQ documents and solve users' questions in a timely manner; in the legal field, when building a legal document question-and-answer system, standard RAG is used to retrieve key information from a large number of case laws, policies and regulations, and contract documents to provide users with legal, compliant, and well-reasoned answers. In addition, for internal knowledge management of enterprises, standard RAG is also an ideal choice for building efficient internal knowledge assistants, which can help employees quickly obtain the information they need and improve work efficiency.

2. Optimization strategies: corrective RAG, speculative RAG and fusion RAG

1. Corrective RAG: Accurately Optimizing Answers

Corrective RAG focuses on solving the problem of inaccurate model answers. It continuously optimizes answers by building a feedback loop mechanism. In actual application scenarios, such as on e-learning platforms, when the automatically generated test answers are not accurate enough, corrective RAG can conduct in-depth verification and correction of the answers based on the feedback information provided by students or teachers, thereby improving the accuracy of the answers and enhancing user satisfaction. In the medical field, the importance of corrective RAG is even more prominent when building medical chatbots. The accuracy of medical information is related to the health and safety of patients, and any incorrect information may lead to serious consequences. By introducing corrective RAG, after generating an answer, the chatbot will check the consistency of the answer with the real medical information again, promptly discover and correct possible errors, and ensure that reliable medical consultation services are provided to patients.

Speculative RAG: Balancing Speed ​​and Accuracy

Speculative RAG adopts an innovative strategy of "quick draft first, then fine verification". It uses a small and fast model to quickly generate preliminary answers, just like a painter sketches out a sketch first, to determine the general direction and content framework; then it uses a large model to strictly verify and refine the preliminary answers to ensure that the final answer has both speed advantages and accuracy. In the news field, news summary robots have high requirements for speed and accuracy. Speculative RAG enables robots to quickly generate a first draft of a news summary in a short period of time, and then generate an accurate news summary after careful verification and improvement by a large model. In e-commerce scenarios, speculative RAG also plays an important role in building a product description generator. A small and fast model can quickly generate a first draft of a product description, and a large model verifies the accuracy of the description based on the product's specifications and catalog information, ensuring that the product description is both creative and authentic, thereby increasing consumers' willingness to buy.

3. Fusion RAG: Integrating diverse knowledge

Fusion RAG breaks through the limitations of a single source of knowledge. It obtains information from multiple search engines and data sources and organically integrates this information to provide richer and more comprehensive knowledge support for generating answers. In the field of financial analysis, market conditions are complex and changeable, and multiple factors need to be considered comprehensively. Fusion RAG can integrate policy information in regulatory documents, real-time dynamics in market news, and professional opinions of experts to provide investors with comprehensive and in-depth financial analysis reports to help them make more informed investment decisions. When building a cross-platform legal consulting assistant, fusion RAG can collect legal information from multiple platforms such as court rulings, legal databases, and industry news websites, and provide users with accurate and authoritative legal advice after comprehensive analysis to meet users' needs in complex legal scenarios.

3. Intelligent expansion: agent-based RAG, self-based RAG and adaptive RAG

1. Agent-based RAG: an intelligent assistant that makes autonomous decisions

Agent-based RAG introduces the concept of intelligent agents that can dynamically make plans, perform knowledge retrieval, and generate answers based on real-time situations. In the field of artificial intelligence research, agent-based RAG can play its unique advantages when faced with complex multi-step scientific queries. For example, in a policy research scenario, when building an autonomous policy research assistant, the agent can automatically retrieve relevant data from legislative databases, academic research papers, and current news, and conduct in-depth analysis and comparison of the data, identify contradictions, and sort information sources according to credibility. Finally, it generates a detailed and logically rigorous policy briefing, and accurately cites relevant sources. In terms of competitive intelligence analysis, agent-based RAG can help startups continuously monitor the dynamics of competitors, collect information from multiple channels such as website updates, news releases, and social media, and provide companies with valuable market analysis briefings after analysis and integration, helping companies to formulate more competitive development strategies.

2. Self-RAG: Optimization based on own experience

When retrieving knowledge, the self-RAG will first look for relevant information from its previous output results, and will only seek help from external knowledge bases when its own experience cannot meet the needs. This mechanism is of great significance in some scenarios where consistency needs to be maintained. For example, in the creation of long stories, the self-RAG can ensure that the story maintains a consistent style and plot logic between different chapters. In the field of academic research, when building an academic research critique assistant, the self-RAG can first review previous analyses and evaluations of similar studies, and on this basis, combine the newly retrieved relevant literature to conduct a more in-depth and comprehensive critique and summary of academic papers, thereby improving the quality and efficiency of research.

3. Adaptive RAG: Intelligently determine retrieval requirements

Adaptive RAG can intelligently decide whether knowledge retrieval is needed based on the characteristics of the problem and the model's own judgment. It triggers retrieval operations through internal model signals, achieving a balanced use of internal memory and external knowledge. In the medical field, when a virtual medical assistant handles patient consultations, if it is a common simple question, such as general cold symptom consultation, the assistant can use internal memory to quickly give an answer; for complex symptoms, such as diagnosis consultations for rare diseases, the assistant will actively search external databases to obtain more professional and comprehensive medical information and provide patients with accurate diagnostic advice. In the internal help desk scenario of an enterprise, adaptive RAG can intelligently adjust the retrieval strategy based on the user's role and question type. For example, for complex technical questions raised by technicians, the help desk system will retrieve detailed technical documents and logs; and for simple questions about the onboarding process for new employees, the system will quickly obtain answers from the FAQ library to improve service efficiency.

4. Advanced Applications: REFEED, REALM and RAPTOR

1. REFEED: Optimization without Retraining

The uniqueness of REFEED technology is that it does not require retraining of the model, but improves the quality of answers by optimizing the retrieval process. It reorders and optimizes answers based on feedback signals after retrieval, such as user click behavior or ratings on documents. In enterprise search engine optimization, REFEED can analyze user search behavior in real time, understand user satisfaction with search results, and then adjust retrieval strategies to make search results more in line with user expectations. In the field of human resources, when building an intelligent interview assistant, REFEED can adjust the retrieval and generation strategies of subsequent questions in a timely manner based on real-time feedback from the interviewer, such as corrections or evaluations of answers to a question, to improve the efficiency and quality of the interview.

REALM: Retrieval-Aware Language Modeling

REALM integrates the training of the retriever into the model training stage, uses large-scale corpora (such as Wikipedia-scale corpora) for training, and adopts advanced technologies such as maximum inner product search (MIPS) to enable the model to learn effective retrieval patterns during the training process. This training method enables the model to perform well in open-domain question-answering scenarios, and can more accurately understand questions and retrieve relevant information. In the project of generating biographies, the model trained based on REALM can accurately retrieve information related to the person from a large number of news archives, interview records and articles, and organically integrate this information to generate rich, accurate and detailed biographies. In the medical field, when building a medical question-answering system for professionals, REALM can enable the model to deeply understand the retrieval needs of medical literature. When answering questions, it can not only retrieve relevant research, but also accurately grasp the medical background of the research, and provide more professional and reliable medical answers.

3. RAPTOR: Efficient retrieval based on tree-like reasoning

RAPTOR uses a unique tree structure to organize and retrieve content, clustering knowledge into a hierarchical tree structure, from macro themes to specific details, to achieve multi-level retrieval. This structure has significant advantages in dealing with complex problems and can quickly locate different levels of relevant information. In the legal research scenario, the legal research robot can use RAPTOR's tree retrieval structure to start from broad regulatory categories and gradually go deep into specific case details, efficiently retrieving the required legal provisions and case information. In the field of financial risk assessment, when building complex financial risk assessment agents, RAPTOR can decompose investment risk assessment problems into multiple sub-factors, such as market fluctuations, regulatory changes, company fundamentals, etc., search along the path corresponding to each sub-factor, collect relevant financial data and information, and finally conduct comprehensive analysis to generate a comprehensive and accurate risk assessment report.

5. Diversified Expansion: REVEAL, REACT and Memo RAG

1. REVEAL: Fusion of Vision and Reasoning

REVEAL is designed specifically for visual-language tasks. It combines reasoning ability with visual information and is based on real-world visual facts, enabling the model to reduce hallucinations when dealing with problems involving images and improve the accuracy and reliability of answers. In the quality inspection process of the manufacturing industry, when building a visual compliance inspection assistant, REVEAL can conduct in-depth analysis of product design or packaging images, extract key visual features in the image, such as warning labels, product logos, etc., and retrieve relevant regulatory standards and brand specification documents to accurately judge whether the product is compliant, identify problems in a timely manner, and make rectification suggestions. In the field of education, for scenarios based on graph learning, such as graph teaching in subjects such as biology, physics, and geography, REVEAL can help intelligent tutors better understand the graphs presented by students, retrieve relevant textbook content, and provide students with detailed graph interpretations and knowledge point explanations to promote students' understanding and mastery of knowledge.

2. REACT: Synergy of thinking and action

REACT introduces a "thinking-action" loop mechanism, which enables the model to perform step-by-step reasoning when dealing with problems, and calls corresponding tools (such as search APIs, calculators, databases, etc.) based on the reasoning results to complete the task. In the field of programming, coding assistance tools can use the REACT mechanism to generate possible solution hypotheses through reasoning when encountering code debugging problems, and then call related document retrieval tools and code execution environments to verify and correct the hypotheses, and gradually solve the problems in the code. In the legal industry, when building legal assistants, REACT can help lawyers perform logical reasoning based on the specific circumstances of the case when handling cases, determine the regulations and cases that need to be retrieved, and then retrieve information by calling the legal database to analyze the contradictions in the case, and finally provide lawyers with strong support for case analysis and legal document drafting.

3. Memo RAG: Memory Optimized Retrieval

Memo RAG stores and manages previously retrieved useful documents and information by building a retrieval memory cache. When encountering similar problems, the system can directly obtain relevant information from the cache, avoiding repeated retrieval of the entire corpus, thereby greatly improving retrieval efficiency and reducing response delays. In customer service scenarios, for common repetitive questions such as bill inquiries, policy consultations, etc., Memo RAG can enable chatbots to quickly extract previous answers from the memory cache, provide customers with timely and accurate services, and improve customer satisfaction. In the field of personal learning assistance, when building an AI learning coach, Memo RAG can remember the knowledge points that users have retrieved during the learning process, the difficulties they encountered, and the misunderstandings they have, and provide users with personalized learning suggestions and review materials based on this historical information, helping users learn complex knowledge systems more efficiently.

6. Overview of other special RAG types

In addition to the RAG types highlighted above, the article also mentions a variety of RAG architectures with their own characteristics. Graph RAG structures the relationships between entities and concepts by building a knowledge graph, enabling the model to reason based on these complex relationships, thus improving the logic and interpretability of answers. Duo RAG combines two generators or retrievers, using the diversity of the model to reduce the risk of hallucinations and improve the reliability of answers. Context-Aware RAG can remember the user's contextual information, including historical conversations, behaviors, and preferences, to provide more personalized services. Ensemble RAG combines multiple RAG pipelines together, selects or merges the best output according to task requirements, and balances speed, cost, and accuracy. Multimodal RAG breaks through the limitations of text and includes data in multiple modalities such as images, videos, and audio in the scope of knowledge retrieval, providing users with richer and more comprehensive information. Federated RAG is suitable for scenarios with dispersed data, and realizes knowledge retrieval under the premise of protecting data privacy. Online RAG can update the knowledge base in real time to ensure the timeliness of information. Modular RAG adopts a flexible plug-in architecture, which allows users to replace components according to different task requirements. Multi-Hop RAG is suitable for complex problems that require multi-step reasoning, and finally obtains accurate answers by gradually retrieving and answering sub-questions. Tool-Integrated RAG combines RAG with tool usage capabilities, enabling the model to perform various operations in the process of generating answers; Cascade RAG adopts a hierarchical retrieval architecture to gradually optimize retrieval results and improve retrieval quality; Asynchronous RAG supports parallel operations and event-driven operations of different components, and is suitable for distributed and multi-threaded application scenarios.

7. Choose the RAG type that suits your project

In actual project development, choosing the right RAG type is a key step to ensure project success. Developers need to comprehensively consider multiple factors such as the specific needs of the project, data characteristics, and performance requirements. If the project is oriented towards open domain question-answering scenarios, has high requirements for response speed, and pursues a simple and efficient architecture, then standard RAG is a good starting point; if the project has extremely high requirements for answer quality and needs to continuously optimize and correct answers, corrective RAG or self-RAG may be more appropriate; when the project involves complex knowledge fields and needs to process structured relationships or multi-modal data, Graph RAG or Multimodal RAG can play a greater role; for projects that want to build intelligent systems with autonomous decision-making capabilities, agent-based RAG combined with REACT or Tool-Integrated RAG is an ideal choice.