A heat map with over 100,000 views, describing the mainstream architecture of RAG

Explore how the RAG architecture revolutionizes the performance of large models and leads the new paradigm of "retrieval-generation".
Core content:
1. RAG breaks through the limitations of traditional models and improves the generation effect of large models
2. The definition, advantages and mainstream architecture analysis of RAG
3. The effect and traceability advantages of RAG in practical applications
Table of contents:
Why do we need RAG? RAG definition, benefits, and common architectures What other ways can improve the generation results of large models? RAG Practice
Why do we need RAG?
Cloud Native
In the early days, when training large models, the training set was fixed, so they performed poorly when faced with time-sensitive, field-specific, or long-tail questions. For example, when asking academic questions in the biomedical field or asking the large model to summarize the sports events that ended this year, it would have serious hallucinations or directly give the answer "My knowledge update is up to xx month of xxxx, and I cannot access or retrieve data after that."
The core idea of RAG is to combine the retrieval system with the generative model, and enhance the model's generative ability by dynamically retrieving external knowledge bases. This architecture breaks through the static knowledge limitations of traditional models, expands the model, and opens up a new paradigm of "generation + retrieval" collaborative work.
What is the definition, advantages and common architecture of RAG?
Cloud Native
RAG (Retrieval-Augmented Generation) was first introduced in detail in the paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" [ 1] published in 2020 .
Since the external knowledge base is carefully selected, the quality of the corpus of the knowledge base itself is relatively high, and because it is maintained by a dedicated person, the generation effect is greatly improved. For example, there are China HowNet, the National Laws and Regulations Database, Wanfang Database, etc., and foreign ones include Wikipedia, ArXiv, Google Scholar, etc.
RAG brings many advantages, including the ability to update the knowledge base without retraining the model; the ability to trace the generated results back to the specific document fragment retrieved; and lower deployment and maintenance costs compared to fine-tuning large models using a corpus. We summarize its advantages as follows:
Common architecture of RAG:
Naive RAG: The most basic RAG architecture generates responses by retrieving relevant document fragments and feeding them into the generative model as context, just like when you go to the library to find books, you first pick a few related ones, then read them and write a summary.
Retrieve-and-rerank: This architecture first retrieves relevant document fragments, and then reranks these fragments to select the most relevant fragments as context input to the generative model. Just like you first find a bunch of books in the library, then carefully see which books are most relevant, and then use these books to write a summary.
Multimodal RAG: This architecture handles multimodal data (such as text, images, etc.), converts different types of input into a unified representation through a multimodal embedding model, and then retrieves and generates. This time you not only look for books, but also for pictures, videos and other materials, turn them all into comparable things, and then find the most relevant ones to use.
Graph RAG: This architecture uses a graph database to store and retrieve information, capturing the relationships between data through a graph structure to generate more relevant responses. You organize your data in a special way, such as a mind map, so that it is easier to find the most relevant information.
Hybrid RAG: This architecture combines multiple retrieval and generation technologies to improve the flexibility and performance of the system. You use various methods to find information, such as libraries, web searches, etc., and then use them in combination.
Agentic RAG (Router): This architecture uses agents to route queries to different retrieval and generation modules, selecting the best processing path based on the type and requirements of the query. You have an assistant who helps you decide where to find information, such as going to the library or searching the Internet.
Agentic RAG (Multi-Agent RAG): This architecture uses multiple agents to work together, each responsible for a specific task or data source, to complete complex retrieval and generation tasks. You have a group of assistants, each responsible for a different task, such as looking for books, and searching the Internet, and finally everyone works together to complete the task.
What other ways are there?
Can improve the generation results of large models
Cloud Native
Prompt words are the most primitive and intuitive way to interact with the big language model. They are instructions or guiding information. For example, a year ago, when we interacted with the big model, we often set a role for the big model first. "Suppose you are a fitness coach, help me prepare a fitness menu for the week."
Fine-tuning means continuing to train the pre-trained model (such as BERT, GPT) using data from a specific task to adjust the model parameters to adapt it to new tasks (such as legal document output, medical text classification). There are many technical solutions for fine-tuning, which are iterating along the evolutionary path of cost-effectiveness (resource input vs. generation effect). The mainstream fine-tuning solutions are:
Full fine-tuning: Deeply adapt model parameters and retrain the big model with specific data to make it better at a certain field, such as a big law model or a big medical model, but the training cost is high. It is suitable for fields with high fault tolerance requirements such as legal contract generation and medical diagnosis. Lora fine-tuning: LoRA (Low-Rank Adaptation) is a more cost-effective large model fine-tuning technology. Through low-rank matrix decomposition, only a small number of new parameters are trained to adapt to downstream tasks, while freezing the original parameters of the pre-trained model. The training cost is low, but the generation effect may not be as good as full fine-tuning. It is suitable for scenarios such as fine-tuning stylization in writing and image generation.
Distillation is a model compression technique that allows a small model (student model) to imitate the "knowledge" of a large and complex model (teacher model), so that the small model can significantly reduce the computing resource requirements while maintaining high performance.
RAG Practice
Cloud Native
Here we design a cost-reduction scenario to experience the practice of RAG+gateway+vector database, which was used in the Higress programming challenge [ 2] .
The Big Model API service pricing is divided into X yuan (cache hit) / Y yuan (cache miss) per million input tokens. X is much lower than Y. Taking the Tongyi series micro example, X is only 40% of Y. If a better cache hit logic can be designed, it will not only reduce the response delay, but also reduce the cost of calling the Big Model API.
A high cache hit rate is required. After a cache hit, there is no need to send a request to the LLM API. The following are some situations:
Case 1: Single round caching
Take the following case as an example. The second request should use the result returned by the first LLM request instead of requesting the LLM API again:
Case 1
What is Higress Introducing Higress
Case 2
How is the Wasm plugin implemented? Implementation principle of Wasm plug-in
Case 3
Can I dynamically modify the Wasm plugin logic of Higress without affecting the gateway traffic? Does Higress' Wasm plugin logic support hot update?
Case 2: Multi-round caching
Take the following case as an example. A1 of the first multi-round dialogue can be used for A2 of the second multi-round dialogue, and A2 of the first multi-round dialogue can be used for A1 of the second multi-round dialogue:
The first group of dialogue
Q1: Can Higress replace Nginx Ingress? A1: xxxxxx Q2: What about Spring Cloud Gateway? A2: xxxxxx
The second group dialogue
Q1: Can Higress replace Spring Cloud Gateway? A1: xxxxxx Q2: What about Nginx Ingress? A2: xxxxxx
For requests that should not hit the cache, the results in the cache should not be returned. Instead, the LLM API should be requested to return the results. The following are some of the situations:
Case 1: The two questions have low similarity
For example, in the following case, the second request should not return the result of the first request:
Case 1
What is Higress Give me some examples of Higress users using it
Case 2
How is the Wasm plugin implemented? How many WASM plugins does Higress have?
Case 2: Incorrect multi-round caching
Take the following case as an example. The final result of the first multi-round dialogue cannot be used for the result of the second multi-round dialogue:
The first group of dialogue
Q1: Can I modify the Wasm plugin logic of Higress dynamically? A1: xxxxxx Q2: How to operate? A2: xxxxxx
The second group dialogue
Q1: Can I modify Higress's routing configuration dynamically? A1: xxxxxx Q2: How to operate? A2: xxxxxx
Case 3: Avoid returning irrelevant content
Still taking the RAG scenario based on Higress content as an example, the following questions should all return "Sorry, I can't reply to this question", for example:
May I have your name What's the weather like today?