Agentic RAG: Enhanced Retrieval Generation System (RAG)

Written by
Iris Vance
Updated on:July-17th-2025
Recommendation

Agentic RAG: A revolutionary breakthrough in AI agents, which greatly improves RAG's ability to handle complex tasks.

Core content:
1. The difference and advantages between Agentic RAG and traditional RAG
2. The core architecture and technical implementation of Agentic RAG
3. The application of Agentic RAG in integrating multi-source information and optimizing output results

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)


-text-

Agentic RAG is an AI Agent-based RAG implementation.

Specifically, it incorporates AI Agent into the RAG process to coordinate its components and perform additional operations beyond simple information retrieval and generation to overcome the limitations of the basic RAG process.

This transformation from a "tool" to an "agent" has improved the processing power of the RAG system.

  • 1. Basic Concepts
    • 1.1 What is Retrieval-Augmented Generation (RAG)
    • 1.2 What is an Agent
    • 1.3 What is Agentic RAG
    • 1.4 Differences between Agentic RAG and RAG
  • 2. Core Architecture
    • 2.1 Agentic RAG Architecture
    • 2.2 Implementation Idea of ​​Agentic RAG
  • 3. Technical Implementation
    • 3.1 Planning Module
    • 3.2 Execution Module
  • 4. Limitations of Agentic RAG

In the application of generative AI, the "hallucination problem" of the large language model (LLM) has always been a core obstacle restricting its implementation.

Traditional retrieval-augmented generation (RAG) alleviates this contradiction by introducing external knowledge bases, but its linear process (retrieval → generation) is still insufficient when facing complex tasks.

For example, when a user asks, "How do I choose an air purifier suitable for home use?", traditional RAGs can often only return scattered product purchase suggestions, but are unable to systematically integrate details such as "core performance indicators", "user reviews and reputation", and "cost-effectiveness of different brands".

Unlike the passive response of traditional RAG, Agentic RAG is more like a "project manager" that can autonomously optimize problems, dynamically plan search paths, and optimize output results through multiple rounds of verification. This transformation from "tool" to "intelligent agent" greatly improves the system's processing capabilities.

1. Basic Concepts 

1.1 What is Retrieval-Augmented Generation (RAG)

A simple RAG consists of a retrieval component (usually consisting of an embedding model and a vector database) and a generation component (an LLM).

At inference time, the user query question is searched for similarity among the indexed documents, and the documents most similar to the question are retrieved to provide additional context for the LLM.

It enhances the traditional Large Language Model (LLM) by integrating external knowledge sources, enabling LLM to access and utilize a wealth of other information beyond the initial training data.

Think of the RAG as a scholar who, in addition to having his or her own knowledge, has instant access to a comprehensive library.

Typical RAG applications have two significant limitations:

  1. Simple RAGs consider only one external knowledge source. However, some scenarios may require multiple external knowledge sources, while others may require external tools and APIs, such as web search.

  2. The context is retrieved only once. There is no reasoning or validation about the quality of the retrieved context.

Related reading:

From a beginner to an expert in artificial intelligence: A simple understanding of cosine similarity

Local knowledge base, using RAG to solve the problem of accurate information generation

1.2 What is an Agent

AI Agent, or artificial intelligence agent, is generally called an intelligent agent . It is a system that can perceive the environment, make decisions and take actions .

These systems are capable of performing reactive tasks as well as proactively finding solutions to problems, adapting to changes in the environment, and making decisions without direct human intervention.

The core components of Agent are:

  • Model

    In the context of Agent, the model refers to the language model (LLM) of the central decision maker in the Agent process.

  • memory

    It consists of two parts: short-term memory and long-term memory. Short-term memory is related to contextual learning and is part of cue engineering, while long-term memory involves long-term retention and retrieval of information, usually using external vector storage and fast retrieval.

  • planning

    Agents need to have planning (and decision-making) capabilities to effectively perform complex tasks. This involves sub-goal decomposition, continuous thinking (i.e., thought chaining), self-reflection and criticism, and reflection on past actions.

  • tool

    The various tools that the Agent may call, such as calendar, calculator, code interpreter, and search function, etc.

Related reading:

What is AI Agent?

Developing AI Agent from 0 to 1 (Part 1) | AI Agent technology framework based on big models

1.3 What is Agentic RAG

Agentic RAG is an AI Agent-based RAG implementation.

Specifically, it incorporates AI Agent into the RAG process to coordinate its components and perform additional operations beyond simple information retrieval and generation to overcome the limitations of the basic RAG process.

RAG Agent can reason and act in the following example retrieval scenarios:

  1. Deciding whether to retrieve information
  2. Decide which tool to use to retrieve relevant information
  3. Optimizing query issues
  4. Evaluate the retrieved context and decide if re-retrieval is necessary.

1.4 Differences between Agentic RAG and RAG

While the basic concept of RAG (sending queries, retrieving information, and generating responses) remains the same, the use of tools makes it more general and, therefore, more flexible and powerful.

1.4.1 Core Mechanism of Traditional RAG

  • The process is fixed : user asks a question → retrieves external data once → generates an answer.
  • Passive response : Direct retrieval based on the current query, generating one-time results.
  • Applicable scenarios : simple questions and answers, factual queries (such as "How many planets are there in the solar system?").

1.4.2 Innovations of Agentic RAG

Autonomous decision-making and dynamic processes :

  • Task optimization and decomposition : Optimize user questions or break down complex questions into subtasks (e.g., “compare A and B” can be decomposed into searching A, searching B, and performing comparative analysis).
  • Multi-step interaction : multiple search and generation cycles to gradually optimize the results (e.g., first check the definition, then the application, and finally integrate).
  • Dynamic adjustment : Adjust search strategies based on generated content or user feedback (e.g., supplement missing information sources).

Initiative and Agent Characteristics :

  • Planning module : decide when to search and how to break down the problem.
  • Self-assessment : Checks answer completeness and triggers secondary search or correction (e.g. revalidation when data conflicts are found).

Interaction and feedback :

  • Multi-turn dialogue : Supports context-aware continuous interaction (such as follow-up questions and clarification needs).
  • User feedback leverage : Optimize next steps based on user corrections (e.g., “I need more details”).
featureTraditional RAGAgentic RAG
Process flexibility
Fixed single search generation
Dynamic multi-step, self-adjusting process
Task processing capability
Suitable for simple, clear questions
Handling complex, multi-level queries
Interactivity
Single round response
Support multi-round dialogue and context tracking
Decision-making autonomy
Passive execution
Proactively plan, break down tasks and optimize paths
Feedback Mechanism
None or Limited
Built-in self-assessment and user feedback integration
Applicable scenarios
Factual Q&A, document summaries
In-depth analysis, comparative studies, open domain complex tasks

2. Core Architecture 

2.1 Agentic RAG Architecture

In its most basic form, an agent is a router. The agent decides from which source to retrieve additional context. External knowledge comes from (vector) databases or other channels.

In more complex scenarios, multi-agent systems come into play. These architectures involve multiple agents working together, with each agent focusing on a specific task or data source.

2.2 Implementation Idea of ​​Agentic RAG

The core architecture of the basic Agentic RAG is mainly composed of two basic modules: planning module and execution module.

Through dynamic collaboration, these modules achieve full process automation from task optimization to result generation.

2.2.1 Planning module: task optimization and decision making

The planning module is the "brain" of Agentic RAG, responsible for optimizing user query questions or breaking down complex questions into executable subtasks. Its core functions include intent recognition and task optimization.

For example, when a user asks “How to choose an air purifier suitable for home use?”, the planning module will first identify the core requirements of the question (such as “product purchase recommendations”), and then generate a series of subtasks, such as “retrieve the core performance indicators of the air purifier”, “find user reviews and reputation”, “analyze the cost-effectiveness of different brands”, etc.

2.2.2 Execution module: knowledge retrieval and generation optimization

The execution module is the "hands" of Agentic RAG, responsible for completing specific retrieval and generation tasks.

Among them, generation optimization is the key function of the execution module. The generation process of traditional RAG is often one-time, while Agentic RAG introduces a multi-round generation and correction mechanism.

For example, when the system detects low-credibility claims in generated content, it automatically triggers a revised search to supplement missing information or correct erroneous content.

3. Technical Implementation 

The complete state diagram:

# Define a new workflow
workflow = StateGraph(AgentState)   # Use AgentState as the state type

# Define nodes in the workflow 
workflow.add_node( "agent" , agent)   # Add agent node
retrieve = ToolNode([retriever_tool])   # Create a retrieval tool node
workflow.add_node( "retrieve" , retrieve)   # Add retrieval node
workflow.add_node( "rewrite" , rewrite)   # Add problem rewrite node
workflow.add_node( "generate" , generate)   # Add answer generation node (used after confirming that the document is relevant)

# Set the initial edge: from START to agent node
workflow.add_edge(START,  "agent" )

# Set conditional edges: decide whether to retrieve based on the agent's decision
workflow.add_conditional_edges(
    "agent" ,
    tools_condition,
    {
        # Map conditional outputs to nodes in the graph
        "tools""retrieve" ,   # If tools are needed, go to the retrieve node
        END: END,   # Otherwise end the process
    },
)

# Set the conditional edge after retrieval
workflow.add_conditional_edges(
    "retrieve" ,
    # Evaluate document relevance
    grade_documents,
)

# Set fixed edges
workflow.add_edge( "generate" , END)   # End after generating the answer
workflow.add_edge( "rewrite""agent" )   # Return to the agent node after rewriting the problem

Related reading:

LangChain Practice | LangGraph makes your AI project from "good" to "excellent"

3.1 Planning Module

The planning module consists of agent, query optimization (rewrite query) and document relevance evaluation (grade_documents)

# Document relevance evaluation function
def grade_documents (state)  -> Literal["generate", "rewrite"]: 
    """
    Evaluate whether the retrieved documents are relevant to the question

    Args:
        state (messages): current state

    Returns:
        str: decision result of whether the document is relevant
    """


    print( "---Check document relevance---" )

    # Data model
    class grade (BaseModel) : 
        """Binary score for relevance check"""
        binary_score: str = Field(description= "Relevance score 'yes' or 'no'" )

    # Initialize the LLM model
    model = ChatOpenAI(temperature= 0 , model= "gpt-4o-mini" , base_url= "https://api.openai-hk.com/v1" , streaming= True )

    # Add structured output capabilities to LLM
    llm_with_tool = model.with_structured_output(grade)

    # Prompt word template
    prompt = PromptTemplate(
        template= """You are a scorer that evaluates the relevance of retrieved documents to the user's question.\n 
        This is the retrieved document: \n\n {context} \n\n
        This is the user's question: {question} \n
        Documents are rated relevant if they contain keywords or semantic meaning related to the user's question.\n
        Gives a binary score of 'yes' or 'no' to indicate whether the document is relevant to the question. """
,
        input_variables=[ "context""question" ],
    )

    # Build the processing chain
    chain = prompt | llm_with_tool

    # Get messages and documents
    messages = state[ "messages" ]
    last_message = messages[ -1 ]
    question = messages[ 0 ].content
    docs = last_message.content

    # Execute evaluation
    scored_result = chain.invoke({ "question" : question,  "context" : docs})
    score = scored_result.binary_score

    # Return decision based on score
    if  score ==  "yes" :
        print( "---Decision: Document related---" )
        return "generate"
    else :
        print( "---Decision: Document not relevant---" )
        print(score)
        return "rewrite"


### Nodes


def agent (state) : 
    """
    Call the agent model to generate a response based on the current state. Decide whether to use the retriever tool or end directly based on the question.

    Args:
        state (messages): current state

    Returns:
        dict: updated state, containing the agent response attached to the message
    """

    print( "---Calling Agent---" )
    messages = state[ "messages" ]   # Get the current message
    model = ChatOpenAI(temperature= 0 , streaming= True , base_url= "https://api.openai-hk.com/v1" ,model= "gpt-4o-mini" )   # Initialize LLM model
    model = model.bind_tools(tools)   # Binding tools
    response = model.invoke(messages)   # Call the model to generate a response
    return  { "messages" : [response]}   # Returns the updated message list


def rewrite (state) : 
    """
    Transform your queries to generate better questions.

    Args:
        state (messages): current state

    Returns:
        dict: updated status, including reformulated question
    """


    print( "---Conversion query---" )
    messages = state[ "messages" ]   # Get the current message
    question = messages[ 0 ].content   # Get the original question

    # Tips for building improved questions
    msg = [
        HumanMessage
            content= f""" \n 
    Look at the input and try to understand its underlying semantic intent/meaning.\n 
    This is the initial question:
    \n------- \n
    {question} 
    \n------- \n
    Please ask a revised question: """
,
        )
    ]

    # Initialize the LLM model
    model = ChatOpenAI(temperature= 0 , base_url= "https://api.openai-hk.com/v1" ,model= "gpt-4o-mini" , streaming= True )
    response = model.invoke(msg)   # Generate improved questions
    return  { "messages" : [response]}   # Returns the updated message

3.2 Execution Module

The execution module consists of retrieve and generate

# Define the list of web page URLs to be processed
urls = [
    "https://lilianweng.github.io/posts/2023-06-23-agent/" ,
    "https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/" ,
    "https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/" ,
]

# Load all web documents
docs = [WebBaseLoader(url).load()  for  url  in  urls]
# Flatten a nested list into a one-dimensional list
docs_list = [item  for  sublist  in  docs  for  item  in  sublist]

# Create a text segmenter instance
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size= 100 , chunk_overlap= 50
)
# Perform document segmentation operation
doc_splits = text_splitter.split_documents(docs_list)


# Initialize the embedding model
embedding_llm = OpenAIEmbeddings(
    base_url= "https://api.openai-hk.com/v1"
)

# Create a vector database instance
vectorstore = Chroma.from_documents(
    documents=doc_splits,   # split documents
    collection_name = "rag-chroma" ,   # collection name
    persist_directory = "./chroma_db" ,   # persistent storage directory
    embedding=embedding_llm,   # Embedding model used
)


# Create a retriever instance
retriever = vectorstore.as_retriever()   # Convert the vector database to a retriever
########################################################################################################################
# Import the retriever tool to create a function
from  langchain.tools.retriever  import  create_retriever_tool

# Create a retriever tool instance
retriever_tool = create_retriever_tool(
    retriever,   # retriever instance
    "retrieve_blog_posts" ,   # Tool name
    "Search and return information about Lilian Weng's blog posts on LLM agents, hint engineering, and adversarial attacks on LLMs." ,   # Tool description
)



def generate (state) : 
    """
    Generate Answers

    Args:
        state (messages): current state

    Returns:
         dict: the updated state, containing the generated answer
    """

    print( "---Generate answer---" )
    messages = state[ "messages" ]   # Get the current message
    question = messages[ 0 ].content   # Get the question
    last_message = messages[ -1 ]   # Get the last message
    docs = last_message.content   # Get the retrieved documents

    # Get the RAG prompt word template
    prompt = hub.pull( "rlm/rag-prompt" )

    # Initialize the LLM model
    llm = ChatOpenAI(base_url= "https://api.openai-hk.com/v1" ,model_name= "gpt-4o-mini" , temperature= 0 , streaming= True )

    # Document formatting function
    def format_docs (docs) : 
        return "\n\n" .join(doc.page_content  for  doc  in  docs)   # Merge multiple document contents

    # Build the RAG processing chain
    rag_chain = prompt | llm | StrOutputParser()

    # Execute the RAG chain to generate the answer
    response = rag_chain.invoke({ "context" : docs,  "question" : question})
    return  { "messages" : [response]}   # Returns the updated message containing the answer


prompt = hub.pull( "rlm/rag-prompt" ).pretty_print()  

4. Limitations of Agentic RAG 

Although Agentic RAG has shown great potential, its development still faces many limitations:

Calculate the delay :

Multiple rounds of retrieval and correction mechanisms significantly increase the computational overhead.

Computational costs surge :

Multiple rounds of retrieval and evaluation lead to a significant increase in the number of API calls.

The dilemma of responsibility attribution :

When autonomous medical advice leads to misdiagnosis, liability is unclear


Agentic RAG not only represents the iteration of technology, but also heralds a fundamental change in the human-machine collaboration model. When the system can autonomously complete the entire process of demand analysis → knowledge retrieval → logical reasoning → result verification , human experts can be freed from information overload and focus on higher-level creative work.