Woter AI detection.Hurry - ends Jul 20th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

A heat map with over 100,000 views, describing the mainstream architecture of RAG

Written by

Silas Grey

Updated on:July-17th-2025

The continuous improvement of large model performance has further explored the potential of RAG and broken through the original paradigm of "retrieval-collage".

See below for more details: "RAG definition, advantages, and common architectures"

This diagram has been widely circulated in foreign communities recently. It describes the mainstream architecture of RAG in a structured way. RAG is used to improve the generation effect of large models, making them more intelligent. The continuously improved semantic and logical reasoning capabilities of large models can more accurately identify and apply professional knowledge bases. This article will sort out the basic information of RAG to gain a clearer understanding.

Table of contents:

Why do we need RAG?
RAG definition, benefits, and common architectures
What other ways can improve the generation results of large models?
RAG Practice

Why do we need RAG?

Cloud Native

In the early days, when training large models, the training set was fixed, so they performed poorly when faced with time-sensitive, field-specific, or long-tail questions. For example, when asking academic questions in the biomedical field or asking the large model to summarize the sports events that ended this year, it would have serious hallucinations or directly give the answer "My knowledge update is up to xx month of xxxx, and I cannot access or retrieve data after that."

The core idea of RAG is to combine the retrieval system with the generative model, and enhance the model's generative ability by dynamically retrieving external knowledge bases. This architecture breaks through the static knowledge limitations of traditional models, expands the model, and opens up a new paradigm of "generation + retrieval" collaborative work.

What is the definition, advantages and common architecture of RAG?

Cloud Native

RAG (Retrieval-Augmented Generation) was first introduced in detail in the paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" ^[^1] published in 2020 .

Since the external knowledge base is carefully selected, the quality of the corpus of the knowledge base itself is relatively high, and because it is maintained by a dedicated person, the generation effect is greatly improved. For example, there are China HowNet, the National Laws and Regulations Database, Wanfang Database, etc., and foreign ones include Wikipedia, ArXiv, Google Scholar, etc.

RAG brings many advantages, including the ability to update the knowledge base without retraining the model; the ability to trace the generated results back to the specific document fragment retrieved; and lower deployment and maintenance costs compared to fine-tuning large models using a corpus. We summarize its advantages as follows:

Question Type	Large Model	Large Model+RAG
Knowledge timeliness	The training data cannot be updated after the cutoff.	Retrieve the latest data in real time (such as news, scientific research papers).
Field expertise	Generic models fall short in verticals.	Connect with industry knowledge base (such as legal provisions, medical guidelines).
Factual Accuracy	Prone to hallucinations.	Generate content based on search result constraints.
Long tail coverage	Low-frequency knowledge is easily overlooked.	Supplement rare cases through retrieval.
Explainability	Black box generation is difficult to trace.	Provide the search source as the basis for generation.
cost	Fully fine-tuned, model parameters updated, and training consumed high resources.	No retraining is required, only the retriever needs to be optimized if necessary.

Common architecture of RAG:

Naive RAG: The most basic RAG architecture generates responses by retrieving relevant document fragments and feeding them into the generative model as context, just like when you go to the library to find books, you first pick a few related ones, then read them and write a summary.
Retrieve-and-rerank: This architecture first retrieves relevant document fragments, and then reranks these fragments to select the most relevant fragments as context input to the generative model. Just like you first find a bunch of books in the library, then carefully see which books are most relevant, and then use these books to write a summary.
Multimodal RAG: This architecture handles multimodal data (such as text, images, etc.), converts different types of input into a unified representation through a multimodal embedding model, and then retrieves and generates. This time you not only look for books, but also for pictures, videos and other materials, turn them all into comparable things, and then find the most relevant ones to use.
Graph RAG: This architecture uses a graph database to store and retrieve information, capturing the relationships between data through a graph structure to generate more relevant responses. You organize your data in a special way, such as a mind map, so that it is easier to find the most relevant information.
Hybrid RAG: This architecture combines multiple retrieval and generation technologies to improve the flexibility and performance of the system. You use various methods to find information, such as libraries, web searches, etc., and then use them in combination.
Agentic RAG (Router): This architecture uses agents to route queries to different retrieval and generation modules, selecting the best processing path based on the type and requirements of the query. You have an assistant who helps you decide where to find information, such as going to the library or searching the Internet.
Agentic RAG (Multi-Agent RAG): This architecture uses multiple agents to work together, each responsible for a specific task or data source, to complete complex retrieval and generation tasks. You have a group of assistants, each responsible for a different task, such as looking for books, and searching the Internet, and finally everyone works together to complete the task.

What other ways are there?

Can improve the generation results of large models

Cloud Native

Prompt word

Prompt words are the most primitive and intuitive way to interact with the big language model. They are instructions or guiding information. For example, a year ago, when we interacted with the big model, we often set a role for the big model first. "Suppose you are a fitness coach, help me prepare a fitness menu for the week."

However, as the generation capability of the big model increases, some relatively mechanical instructions can be omitted, such as directly interacting with the big model without going through the character settings. However, the design of the prompt words is still critical, and has been generalized into "questioning skills" between us and the big model, such as "help me extract the core content of this paper and explain it in a language that college students can understand."

Fine-tuning

Fine-tuning means continuing to train the pre-trained model (such as BERT, GPT) using data from a specific task to adjust the model parameters to adapt it to new tasks (such as legal document output, medical text classification). There are many technical solutions for fine-tuning, which are iterating along the evolutionary path of cost-effectiveness (resource input vs. generation effect). The mainstream fine-tuning solutions are:

Full fine-tuning: Deeply adapt model parameters and retrain the big model with specific data to make it better at a certain field, such as a big law model or a big medical model, but the training cost is high. It is suitable for fields with high fault tolerance requirements such as legal contract generation and medical diagnosis.
Lora fine-tuning: LoRA (Low-Rank Adaptation) is a more cost-effective large model fine-tuning technology. Through low-rank matrix decomposition, only a small number of new parameters are trained to adapt to downstream tasks, while freezing the original parameters of the pre-trained model. The training cost is low, but the generation effect may not be as good as full fine-tuning. It is suitable for scenarios such as fine-tuning stylization in writing and image generation.

Let's take an example to understand the difference between the two. Full-scale fine-tuning is a comprehensive study of the four-year aesthetics course in college, while Lora fine-tuning is a specialization in traditional Chinese painting. In addition, there are fine-tuning methods such as Adapter and BitFit, which realize the evolution of large models from "full-scale transformation" to "precision surgery" and pursue a balance between efficiency and results.

Distillation

Distillation is a model compression technique that allows a small model (student model) to imitate the "knowledge" of a large and complex model (teacher model), so that the small model can significantly reduce the computing resource requirements while maintaining high performance.

Dimensions	Distillation	Fine-tuning
Target	Compress the model while maintaining performance.	Adapt models to new tasks.
Model Relationships	Teacher model → Student model (usually smaller).	The same model (or same architecture), self-optimization.
Data Dependency	The output of the teacher model is needed as a supervision signal.	Directly use task annotation data.
Typical scenarios	Mobile deployment and edge computing.	Task adaptation in professional fields (such as legal questions and answers).
Resource consumption	The student model training cost is low.	The cost of full-scale fine-tuning is high, and low-cost solutions such as LoRA are available.

Network search technology

In the early days, when the base model reasoning ability was not ideal and the Internet data governance was not high, if the network search technology was connected, the model illusion would be increased. However, as the base model reasoning performance continued to increase, the illusion gradually decreased, and the large models gradually provided network search capabilities to enhance the performance of large models when dealing with strong efficiency and long-tail problems.

Dimensions	LLM	LLM+RAG	LLM+Online Search
Knowledge Update	❌ Static knowledge: Relies only on data during training and cannot acquire new knowledge (such as events after 2023).	⚠️ Dependence on knowledge base: The knowledge base needs to be updated regularly, and real-time performance depends on the update frequency.	✅ Real-time updates: Access the Internet through API to get the latest information (such as news, stock prices).
Resource consumption	✅ Lowest: Only depends on the model parameters, no additional resource consumption.	⚠️ Medium: Requires storage and retrieval of knowledge base (storage cost), and the retrieval process takes up computing resources.	❌ High: Requires external API calls (cost) and real-time network requests (latency).
Response speed	✅ Fastest: Generate answers directly without additional search steps.	⚠️ Moderate: Searching the knowledge base takes time, but local searches are faster than network requests.	⚠️ Slower: Need to wait for network request to return data, with higher latency.
accuracy	❌ Limited by training data: May produce "hallucinations" (fictitious information), especially for long-tail problems.	✅ High controllability: Based on a structured knowledge base, accuracy depends on the quality of the knowledge base.	⚠️ Reliance on search results: May be affected by search engine noise or false information (filtering mechanism required).
Explainability	❌ Low: No information source can be provided, black box generation.	✅ High: Can be traced back to a specific document or paragraph in the knowledge base.	✅ High: Provides a link to the source of the search result (such as a citing web page).
Applicable scenarios	General conversations, text generation, scenarios that do not require real-time or expertise.	Question answering in professional fields (such as law, medicine), enterprise knowledge management, and tasks that rely on structured data.	Real-time information query (such as weather, stock prices), news summary, and dynamic event analysis.
Data Dependency	✅ No dependencies: only depends on pre-trained model parameters.	⚠️ Dependence on internal/external knowledge base: Domain knowledge base needs to be built and maintained.	❌ Dependence on external data sources: Need to access search engine APIs (such as Google, Bing).
Implementation complexity	✅ Simple: directly call the model to generate results.	⚠️ Higher: Need to build a knowledge base and optimize the retrieval algorithm (such as vector similarity calculation).	⚠️ Medium: Need to integrate API, handle network requests and parse results.
Privacy and Security	✅ High: No external data interaction, but the model itself may leak training data information.	✅ High controllability: Data privacy can be guaranteed when using internal knowledge base.	❌ Higher risk: involves external data, may expose sensitive queries, or rely on uncontrollable sources.

In addition to these high-level optimization solutions, there are also many technical optimization solutions for the underlying large models, including supervised fine-tuning (SFT), reinforcement learning (RLHF), pure reinforcement learning (RL), hybrid expert system (MoE), sparse attention mechanism, etc. These can optimize the reasoning ability of large models from the root.

RAG Practice

Cloud Native

Here we design a cost-reduction scenario to experience the practice of RAG+gateway+vector database, which was used in the Higress programming challenge ^[^2] .

The Big Model API service pricing is divided into X yuan (cache hit) / Y yuan (cache miss) per million input tokens. X is much lower than Y. Taking the Tongyi series micro example, X is only 40% of Y. If a better cache hit logic can be designed, it will not only reduce the response delay, but also reduce the cost of calling the Big Model API.

Higress's plugin market provides a WASM plugin for AI Proxy, which supports docking with LLM providers to implement RAG capabilities. For example, you can import an oven manual as an external knowledge base, use the vector retrieval capabilities provided by Alibaba Cloud's Lindorm, Tair, or other vector retrieval services to perform vector recall on the LLM results, and finally generate content. The plugin workflow diagram is as follows:

Cache optimization ideas

A high cache hit rate is required. After a cache hit, there is no need to send a request to the LLM API. The following are some situations:

Case 1: Single round caching

Take the following case as an example. The second request should use the result returned by the first LLM request instead of requesting the LLM API again:

Case 1

What is Higress
Introducing Higress

Case 2

How is the Wasm plugin implemented?
Implementation principle of Wasm plug-in

Case 3

Can I dynamically modify the Wasm plugin logic of Higress without affecting the gateway traffic?
Does Higress' Wasm plugin logic support hot update?

Case 2: Multi-round caching

Take the following case as an example. A1 of the first multi-round dialogue can be used for A2 of the second multi-round dialogue, and A2 of the first multi-round dialogue can be used for A1 of the second multi-round dialogue:

The first group of dialogue

Q1: Can Higress replace Nginx Ingress?
A1: xxxxxx
Q2: What about Spring Cloud Gateway?
A2: xxxxxx

The second group dialogue

Q1: Can Higress replace Spring Cloud Gateway?
A1: xxxxxx
Q2: What about Nginx Ingress?
A2: xxxxxx

Result accuracy requirements

For requests that should not hit the cache, the results in the cache should not be returned. Instead, the LLM API should be requested to return the results. The following are some of the situations:

Case 1: The two questions have low similarity

For example, in the following case, the second request should not return the result of the first request:

Case 1

What is Higress
Give me some examples of Higress users using it

Case 2

How is the Wasm plugin implemented?
How many WASM plugins does Higress have?

Case 2: Incorrect multi-round caching

Take the following case as an example. The final result of the first multi-round dialogue cannot be used for the result of the second multi-round dialogue:

The first group of dialogue

Q1: Can I modify the Wasm plugin logic of Higress dynamically?
A1: xxxxxx
Q2: How to operate?
A2: xxxxxx

The second group dialogue

Q1: Can I modify Higress's routing configuration dynamically?
A1: xxxxxx
Q2: How to operate?
A2: xxxxxx

Case 3: Avoid returning irrelevant content

Still taking the RAG scenario based on Higress content as an example, the following questions should all return "Sorry, I can't reply to this question", for example:

May I have your name
What's the weather like today?

A heat map with over 100,000 views, describing the mainstream architecture of RAG

RAG (Retrieval-Augmented Generation) was first introduced in detail in the paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" [ 1] published in 2020 .

Here we design a cost-reduction scenario to experience the practice of RAG+gateway+vector database, which was used in the Higress programming challenge [ 2] .

Case 1: Single round caching

Case 2: Multi-round caching

Case 1: The two questions have low similarity

Case 2: Incorrect multi-round caching

Case 3: Avoid returning irrelevant content

RAG (Retrieval-Augmented Generation) was first introduced in detail in the paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" ^[^1] published in 2020 .

Here we design a cost-reduction scenario to experience the practice of RAG+gateway+vector database, which was used in the Higress programming challenge ^[^2] .