Woter AI detection.Hurry - ends Jun 29th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

"Each generation is stronger than the previous one": the evolution of modern RAG architecture

Written by

Audrey Miles

Updated on:June-13th-2025

Editor's note: The author of the article we bring to you today believes that the evolution of RAG technology is a systematic optimization process from simple to complex, from Naive to Agentic. Each optimization is an attempt to solve the pain points that arise when countless companies implement large language model applications.

The article first analyzes the basic architecture of Naive RAG and its core challenges, and then explores three optimization directions in depth: query dynamic optimization (including query rewriting, query expansion and other strategies), semantic understanding enhancement (focusing on the contextual retrieval method proposed by Anthropic), and computing efficiency innovation (objectively evaluating the technical boundaries of cache-enhanced generation (CAG)). Finally, it focuses on the paradigm breakthrough of Agentic RAG and explains its two core mechanisms in detail (dynamic data source routing, answer verification and correction cycle).

Author |

Compiled by Yue Yang

AI systems based on RAG (Retrieval Augmented Generation) were and still are one of the most valuable applications for enterprises to leverage Large Language Models (LLMs). I remember writing my first article about RAG almost two years ago, before the term was widely adopted.

What I described back then was a RAG system implemented in its most basic form. Since then, the industry has continued to evolve, introducing various advanced technologies along the way.

In this article, we will explore the evolution of RAG - from the basic version (Naive) to Agentic. After reading this article, you will understand the challenges that were overcome at each step in the evolution of the RAG system.

The emergence of Naive RAG

The launch of ChatGPT at the end of 2022 made LLMs mainstream, and almost at the same time, Naive RAG came into being. The emergence of retrieval-augmented generation (RAG) technology aims to solve the problems faced by native LLM. In short:

Hallucination problem.
Limited context window size.
No access to non-public data.
New information after the training deadline cannot be automatically acquired, and updating this knowledge requires retraining the model.

The simplest implementation of RAG is as follows:

Naive RAG.

Preprocessing:

1) Split the entire knowledge base text corpus into chunks — each chunk represents a piece of context that can be queried. The target data can come from a variety of sources, such as a mixed database with Confluence documents as the main source and PDF reports as the supplement.

2) Use the Embedding Model to convert each text block into a vector embedding.

3) Store all vector embeddings into a vector database, and save the text representing each embedding and its pointer to the embedding separately.

Retrieval process:

4) In vector databases or knowledge retrieval systems, in order to ensure that queries and stored knowledge can be accurately matched, the same embedding model is needed to process the document content stored in the knowledge base and the questions or queries raised by users.

5) Use the generated vector embeddings to run queries on the index of the vector database. Choose the number of vectors to retrieve from the vector database - this is equivalent to the amount of context you will retrieve and ultimately use to answer the query.

6) The vector database performs an approximate nearest neighbor (ANN) search on the index for the provided vector embedding and returns the previously selected number of context vectors. This process returns the most similar vectors in the given embedding space (Embedding/Latent space), and these returned vector embeddings need to be mapped to their corresponding original text blocks.

7) Pass the question along with the retrieved context text chunk via a prompt. This instructs the LLM to answer the given question using only the provided context. This does not mean that prompt engineering is not needed - you still need to make sure that the answer returned by the LLM fits the expected range, for example, if there is no data available in the retrieved context, you should make sure not to provide a made-up answer.

Dynamic components of the Naive RAG system

Even without employing any advanced techniques, there are many moving components to consider when building a production-grade RAG system.

RAG - Dynamic Components

Search process:

F) Chunking strategy - how to chunk data for external contexts

Choice between small text blocks and large text blocks
Sliding window or fixed window text segmentation method
Whether to associate parent blocks/linked blocks when retrieving, or just use the original retrieved data

C) Choose an embedding model to embed external context into latent space or query from latent space. Contextual embeddings need to be considered.

D) Vector Database

Which database to choose
Choose a deployment location
What metadata should be stored with the vector embeddings? This data will be used for pre-screening before retrieval and filtering results after retrieval.
Index building strategy

E) Vector Search

Choosing a similarity metric
Choose a query path: metadata first or approximate nearest neighbor (ANN) first
Hybrid search solution

G) Heuristic Rules - Business rules applied to the search process

Adjust weights based on the temporal relevance of documents
De-duplicate the context (sort by diversity)
Original source information accompanying the content when searching
Differentiate the original text based on specific conditions (such as user query intent, document type)

Generation stage:

A) Large Language Model - Choose the right LLM for your application

B) Prompt Engineering - Even if you can call for contextual information in the prompt, you still need to design the prompt carefully - you still need to tune the system (Translator's Note: including setting roles, rules, output format, etc.) to generate the expected output and prevent jailbreak attacks.

After completing all the above work, we were able to build a functioning RAG system.

But the harsh reality is that such systems often fail to truly solve business problems. For a variety of reasons, the accuracy of such systems may be low.

Advanced techniques to improve Naive RAG systems

To continuously improve the accuracy of the Naive RAG system, we have adopted some of the more successful techniques:

Query Alteration - Several techniques can be used:

Query rewriting : Let the large language model (LLM) rewrite the original query to better suit the retrieval process. There are many ways to rewrite, for example, to correct syntax errors or simplify the query into a shorter and more concise statement.
Query Expansion : Let LLM rewrite the original query multiple times to create multiple variations. Then, run the search process multiple times to retrieve more potentially relevant context.

Reranking - Reranking the documents retrieved initially using a more complex method than regular contextual search. This usually requires using a larger model and intentionally obtaining far more documents than are actually needed during the retrieval phase. Reranking works best in conjunction with the query expansion mentioned above, as the latter usually returns more data than usual. The whole process is similar to what we often see in recommendation systems.
Fine-Tuning of the embedding model - In some fields (such as medical), the basic embedding model does not work well for data retrieval. In this case, you need to customize your embedding model.

Next, let's look at some other advanced RAG technologies and architectures.

Contextual Retrieval

The concept of contextual retrieval was proposed by the Anthropic team at the end of last year. Its goal is to improve the accuracy and relevance of data retrieved in AI systems based on retrieval-augmented generation (RAG).

I really like how intuitive and simple contextual search is. And it really delivers good results.

The following are the implementation steps of context retrieval:

Contextual Retrieval

Preprocessing:

1) Use the selected chunking strategy to split each document into several text chunks.

2) Add each text block individually to the prompt word along with the complete document.

3) Add instructions to the prompt word, asking LLM to locate the location of the text block in the document and generate a brief context for it. Then enter this prompt word into the selected LLM.

4) Merge the context generated in the previous step with its corresponding original text block.

5) Feed the combined data into a TF-IDF embedder.

6) The data is then input into an LLM-based embedding model.

7) Store the data generated in steps 5 and 6 into a database that supports efficient search.

Retrieval phase:

8) Retrieve relevant context using user query. Use approximate nearest neighbor (ANN) search for semantic matching and TF-IDF index for precise search.

9) Use Rank Fusion technology to merge and remove duplicate search results and select the top N candidates.

10) Rerank the results of the previous step to narrow the scope to the top K candidates.

11) Input the result of step 10 together with the user query into LLM to generate the final answer.

Some thoughts:

Step 3. It sounds (and is) expensive, but by applying prompt caching, the cost can be significantly reduced.
The prompt word caching technology can be used in both proprietary (closed source) model scenarios and open source model scenarios (see the next paragraph).

The short-lived popularity of Cache Augmented Generation

At the end of 2024, a white paper briefly went viral on social media. It introduced a technology that is expected to revolutionize RAG (retrieval augmented generation) (really?) - Cache Augmented Generation (CAG). We already know how conventional RAG works, so here is a brief introduction to CAG:

RAG vs. CAG

1) Precompute all external contexts into the LLM KV cache and store them in memory. This process only needs to be performed once, and subsequent steps can repeatedly call the initial cache without recalculation.

2) Input the system prompt words containing user query to LLM, and provide instructions on how to use the cache context.

3) Return the answer generated by LLM to the user. After completion, clear the temporary generated content in the cache, retaining only the initially cached context, so that LLM is ready for the next generation.

CAG promises more accurate retrieval by storing all context in the KV cache (rather than retrieving only part of the data each time it is generated). How is it in reality?

CAG cannot address inaccuracies caused by extremely long contexts.
CAG has many limitations in terms of data security.
For large organizations, loading the entire internal knowledge base into cache is nearly impossible.
The cache loses its ability to be dynamically updated, and adding new data becomes very difficult.

In fact, since most LLM providers have introduced prompt caching technology, we have been using a variant of CAG. Our approach can be said to be a fusion of CAG and RAG. The specific implementation process is as follows:

Fusion of RAG and CAG

Data preprocessing:

1) In Cache-Enhanced Generation (CAG), we only use data sources that rarely change. In addition to requiring low data update frequency, we should also consider which data sources are most frequently hit by relevant queries. After determining this information, we precompute all selected data into the LLM's KV cache and cache it in memory. This process only needs to be performed once, and subsequent steps can be run multiple times without recalculating the initial cache.

2) For RAG, if necessary, the vectors can be embedded in pre-calculated and stored in a compatible database for retrieval in the subsequent step 4. Sometimes for RAG, only simpler data types are needed and regular databases can meet the needs.

Query path:

3) Construct a prompt word that includes the user query and the system prompt word, and clearly instruct the large language model on how to use the cached context and externally retrieved context information.

4) Convert user queries into vector embeddings for semantic search through the vector database and retrieve relevant data from the context store. If semantic search is not required, query other sources (such as real-time databases or the Internet).

5) Integrate the external context information obtained in step 4 into the final prompt word to enhance the quality of the answer.

6) Return the final generated answer to the user.

Next, we will explore the latest technology development direction - Agentic RAG.

Agentic RAG

Agentic RAG adds two new core components to try to reduce inconsistent results when responding to complex user queries.

Data Source Routing.
Answer verification and correction (Reflection).

Now, let's explore how this works.

Agentic RAG

1) Analyze user queries : The original user query is passed to the agent based on the large language model for analysis. In this stage:

a. The original query may be rewritten (sometimes multiple times) to generate a single or multiple queries that are passed to subsequent processes.

b. The agent determines whether additional data sources are needed to answer the query. This is the first step in demonstrating its autonomous decision-making ability.

2) If other data is needed, the retrieval step is triggered, and data source routing is performed. One or more data sets can be preset in the system, and the agent is given autonomy to select the specific data source suitable for the current query. Here are a few examples:

a. Real-time user data (such as real-time information such as the user’s current location).

b. Internal documents that may be of interest to users.

c. Public data on the Internet.

d. …

3) Once the data is retrieved from potentially multiple sources, we reorder it as in a regular RAG. This is also a critical step because multiple data sources using different storage technologies can be integrated into this RAG system. The complexity of the retrieval process can be encapsulated behind the tools provided to the agent.

4) Try to generate the answer (or multiple answers, or a set of operation instructions) directly through the large language model. This process can be completed in the first round, or after the answer verification and correction (Reflection) stage.

5) Analyze and summarize the generated answers and evaluate their correctness and relevance :

a. If the agent determines that the answer is good enough, it returns it to the user.

b. If the agent thinks the answer needs to be improved, it tries to rewrite the user query and repeats the generation loop. This is the second major difference between conventional RAG and Agentic RAG.

Anthropic's recent open source project MCP will provide strong support for the development of Agentic RAG.

Wrapping up

So far, we have reviewed the evolution of modern Retrieval Augmentation Generation (RAG) architectures. RAG technology is not dead, nor will it disappear in the short term. I believe that its architectures will continue to evolve for some time to come. Learning these architectures and understanding when to use which solution will be a worthwhile investment.

In general, the simpler the solution, the better, as adding system complexity introduces new challenges. Some emerging challenges include:

Difficulty in evaluating end-to-end systems.
Increased end-to-end latency caused by multiple calls to large language models.
Increase in operating costs.