Large model: multiple RAG combination optimization (langchain implementation)

Written by
Clara Bennett
Updated on:June-13th-2025
Recommendation

Explore the cutting-edge technology of large model RAG optimization, combine multiple strategies to effectively solve the hallucination problem, and improve the quality of retrieval and generation.

Core content:
1. Three RAG optimization strategies: Adaptive RAG routing, Corrective RAG fallback, Self-RAG self-correction
2. Comparative analysis of advanced embedding technology based on nomic-embed
3. Modular business process design and Langchain implementation solution

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

 

Large model RAG optimization: Adaptive RAG

    

This document integrates multiple rag optimization strategies and uses langchain to implement them, which can effectively solve the problem of illusion.

summary

We will integrate ideas from the RAG paper into the RAG agent:

  • •  Routing:  Adaptive RAG ( paper ). Routing questions to different retrieval methods
  • •  Fallback:  Corrective RAG ( paper ). Fallback to web search if document is not relevant to query
  • •  Self-correction:  Self-RAG ( paper ). Correcting hallucinated answers or non-answers

The specific logic diagram is as follows:

The general meaning is:

1. First, determine whether to obtain information from rag or search the Internet based on the problem description.

2. If hallucinations appear in the information obtained by rag, return to the network to search again

3. Generate an answer based on the question and information, then check the answer and the information of the question to determine whether there is any hallucination.

4. Finally generate the answer to the question.

5. If the content information is found to be irrelevant during the process of generating the answer, the system will loop back to search the Internet and regenerate the answer.



Embedding Model

Noun Embedding Comparative Training
We initialize the training of nomic embeddings with nomic-bert-2048. Our comparative dataset consists of about 235 million text pairs. We have extensively validated the quality of Nomic Atlas during its collection. You can find the details of the dataset in the nomic-ai/constrastors codebase, and you can explore the 5 million pairs subset in nomic Atlas.
On the Large-Scale Text Embedding Benchmark (MTEB), nomic embeddings outperform Text-Embedding-ada-002 and jina-embeddings-v2-base-en.

Compared with other embedding algorithms, nomic has better results.

Name
SeqLen
MTEB
LoCo
Jina Long Context
Open Weights
Open Training Code
Open Data
nomic-embed
8192
62.3985.53
54.16
jina-embeddings-v2-base-en
8192
60.39
85.45
51.90
text-embedding-3-small
8191
62.26
82.40
58.20
text-embedding-ada-002
8191
60.99
52.7
55.25

But we need to use the embedding corresponding to qwen-max


Business Analytics

Based on the above summary, we split it into multiple modules and implement it step by step. We add the ability of rag search. The overall business process is as follows:

1. A conditional edge (route node) is hidden behind start [deciding to route the question to different retrieval methods]

2. Retrieve node [return the data information obtained from the knowledge base]

3. grade_documents node [Determine if the retrieved documents are relevant to the question. If any document is not relevant, we will set a flag to run a web search]

4. web_search [Web search: looking for answers on the Internet based on questions]

5. generate [Generate answers based on the content of the document]

6. Add the conditional edge grade_generation_v_documents_and_question [Determine whether the generation is based on documents and answers questions, and whether hallucinations are generated]

1. Route module

Questions can be routed to different retrieval methods. We use Agent to decide which direction to route.

# Prompt
router_prompt =  """You are an expert at routing a user question to a vectorstore or web search.

The vectorstore contains documents related to agents, prompt engineering, and adversarial attacks.

Use the vectorstore for questions on these topics. For all else, and especially for current events, use web-search.

Return JSON with single key, datasource, that is 'websearch' or 'vectorstore' depending on the question.
Here is the user question: \n\n {question}. 
"""


router_prompt = ChatPromptTemplate.from_template(router_prompt)
#json format output
class router_out ( BaseModel ): 
    datasource:  str  = Field(description= "Select 'websearch' or 'vectorstore'" )
    res : str  = Field(description= "result" )

router_llm = router_prompt | model.with_structured_output(router_out)

1. We create a prompt.

You are an expert at routing user questions to vector libraries or web searches.
The vector library contains documents related to agents, prompt engineering, and adversarial attacks. [You can modify it according to the knowledge base content, or add another agent to summarize the knowledge base content]
Use the Vector Library to answer questions on these topics. For everything else, especially current events, use a web search.
Returns JSON with a single key data source, either "websearch" or "vectorstore", depending on the question.
This is the user's question:\n\n{question}.

2. Format the output results:

The output is a BaseModel class with two objects datasource and res

class router_out ( BaseModel ): 
    datasource:  str  = Field(description= "Select 'websearch' or 'vectorstore'" )
    res : str  = Field(description= "result" )

2. Retrieve

Define the nodes returned by the knowledge base

def retrieve ( state ): 
    """
    Retrieve documents from vectorstore
    Args:
        state (dict): The current graph state
    Returns:
        state (dict): New key added to state, documents, that contains retrieved documents
    """

    print ( "---RETRIEVE---" )
    question = state[ "question" ]

    # Write retrieved documents to documents key in state
    documents = retriever.invoke(question)
    return  { "documents" : documents}

3. generate (answer questions based on rag)

def generate(state):
    """
    Generate answer using RAG on retrieved documents
    Args:
        state (dict): The current graph state

    Returns:
        state (dict): New key added to state, generation, that contains LLM generation
    """
    print("---GENERATE---")
    question = state["question"]
    documents = state["documents"]
    loop_step = state.get("loop_step", 0)

    # RAG generation
    docs_txt = format_docs(documents)

    rag_prompt_formatted = rag_prompt.format(context=docs_txt, question=question)
    generation = model.invoke([HumanMessage(content=rag_prompt_formatted)])
    return {"generation": generation.content, "loop_step": loop_step + 1}

4. grade_documents

def grade_documents ( state ): 
    """
    Determines whether the retrieved documents are relevant to the question
    If any document is not relevant, we will set a flag to run web search

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): Filtered out irrelevant documents and updated web_search state
    """


    print ( "---CHECK DOCUMENT RELEVANCE TO QUESTION---" )
    question = state[ "question" ]
    documents = state[ "documents" ]

    # Score each doc
    filtered_docs = []
    web_search =  "No"

    for  d  in  documents:
        result = grader_llm.invoke(
            { "document" : d,  "question" : question}
        )
        grade = result.binary_score
        # Document relevant
        if  grade.lower() ==  "yes" :
            print ( "---GRADE: DOCUMENT RELEVANT---" )
            filtered_docs.append(d)
        # Document not relevant
        else :
            print ( "---GRADE: DOCUMENT NOT RELEVANT---" )
            # We do not include the document in filtered_docs
            # We set a flag to indicate that we want to run web search
            web_search =  "Yes"
            continue
    return  { "documents" : filtered_docs,  "web_search" : web_search}

1. Traverse all documents returned by the knowledge base. If there is a document whose content is irrelevant to the question, we need to supplement it through online search.

5. grade_generation_v_documents_and_question

def grade_generation_v_documents_and_question ( state ): 
    print ( "---CHECK HALLUCINATIONS---" )
    question = state[ "question" ]
    documents = state[ "documents" ]
    generation = state[ "generation" ]
    max_retries = state.get( "max_retries"3 )   # Default to 3 if not provided
    result = hallucination_llm.invoke(
        { "documents" : format_docs(documents),  "generation" : generation}
    )
    grade =result.binary_score

    # Check hallucination
    if  grade ==  "yes" :
        print ( "---DECISION: GENERATION IS GROUNDED IN DOCUMENTS---" )
        # Check question-answering
        print ( "---GRADE GENERATION vs QUESTION---" )
        # Test using question and generation from above
        result = answer_llm.invoke({ "question" :question, "generation" : generation})
        grade = result.binary_score
        if  grade ==  "yes" :
            print ( "---DECISION: GENERATION ADDRESSES QUESTION---" )
            return "useful" 
        elif  state[ "loop_step" ] <= max_retries:
            print ( "---DECISION: GENERATION DOES NOT ADDRESS QUESTION---" )
            return "not useful" 
        else :
            print ( "---DECISION: MAX RETRIES REACHED---" )
            return "max retries" 
    elif  state[ "loop_step" ] <= max_retries:
        print ( "---DECISION: GENERATION IS NOT GROUNDED IN DOCUMENTS, RE-TRY---" )
        return "not supported" 
    else :
        print ( "---DECISION: MAX RETRIES REACHED---" )
        return "max retries" 

1. Determine whether the build is based on the document and answer the question

If it is not relevant and the number of loops is less than 3, the output is not supported. Return to the generate node

2. Determine whether it is relevant based on the answer to the final question

If it is not relevant and the number of loops is <3, the output is useless and the websearch node is returned.


Monitoring and comparison

![image-20250421181430100](/Users/mwx/Library/Application Support/typora-user-images/image-20250421181430100.png)

Checking the monitoring information of langsmith will show the direction of our tasks and the actual time consumed.

1. Call retrieve first and then enter grade_documents

2. Need web search, enter web_search

3. Output of grade_generation_v_documents_and_question

loop_step =1


Comparison

Question: What is cot?

Answer: Chain of Thought (CoT) is a technique for solving complex problems by simulating the human reasoning process, which breaks down the task into a series of logical steps that gradually lead to the final answer. This method improves the transparency and multi-step reasoning ability of the model and is applicable to a variety of tasks such as arithmetic reasoning, common sense reasoning, etc. However, it also has limitations such as the need for high-quality prompts and increased computational costs.

Use rag directly:

Self-consistent sampling #
Self-consistent sampling (Wang et al., 2022a) is to sample multiple outputs with temperature > 0 and then select the best one from these candidates.
The criteria for selecting the best candidate may vary from task to task. A general solution is to choose a majority vote. For tasks that are easy to verify, such as programming problems with unit tests, we can simply run the interpreter and verify its correctness using unit tests.
Chain of Thoughts (CoT) #The
Chain of Thoughts (CoT) prompt (Wei et al., 2022) generates a series of short sentences that step by step describe the reasoning logic, called a chain of reasoning or justification, and finally leads to the final answer. The advantage of CoT is more obvious in complex reasoning tasks while using large models (e.g., more than 50B parameters). Simple tasks only benefit slightly from CoT prompts.
Types of CoT prompts #Two
main types of CoT prompts:
Rarely shot CoT. It prompts the model with a few demonstrations, each of which contains a high-quality chain of reasoning written manually (or generated by the model).

Code Highlights

ChatPromptTemplate

use

ChatPromptTemplate is designed for chat-style interaction scenarios and is used to build prompts in the form of multi-round dialogues. It can handle messages from different roles (such as humans and AI) and combine these messages in a certain format. This is more in line with the needs of applications such as chatbots and conversational AI. The ability of ChatPromptTemplate to handle different roles is the biggest difference from PromptTemplate.
Template structure

ChatPromptTemplate consists of multiple message templates, each corresponding to a specific role (such as HumanMessagePromptTemplate,AIMessagePromptTemplate These message templates can contain their own placeholders to be filled with dynamic content. For example:

from  langchain.prompts.chat  import  (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate
)

# Define system message template
system_template =  "You are a helpful assistant who always answers questions in simple terms."
system_message_prompt = SystemMessagePromptTemplate.from_template(system_template)

# Define user message template
human_template =  "Please provide brief information about {topic}."
human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)

# Combine into a chat prompt template
chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt])

# Build prompt, fill in placeholders
prompt = chat_prompt.format_prompt(topic= "LLM" ).to_messages()

print (prompt)

Main parameter
messages : list type. Defines the structure of chat messages. Each element in the list represents a message, which is usually composed of instances of the BaseMessagePromptTemplate class, such as SystemMessagePromptTemplate, HumanMessagePromptTemplate, etc., corresponding to message templates of different roles such as system messages and human messages. A complete prompt template for the chat dialogue process can be constructed through the messages parameter.

Here are some examples:

[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['topic'], input_types={}, partial_variables={}, template='Please provide brief information about {topic}.\n\n'), additional_kwargs={})]

input_variables : List type. The names of all placeholders in the template string.

LLM's roles
" system " role, creates the context or scope of the conversation by assigning specific behaviors to the chat assistant. For example, if you want to have a conversation with ChatGPT in the scope of sports-related topics, you can assign the "system" role to the chat assistant and set the content to "Sports Expert". ChatGPT will then behave like a sports expert and answer your questions. The
" human " role, represents the actual end-user who sends questions to ChatGPT. The
" ai " role, represents the entity that responds to the end-user's prompt. This role indicates that the message is a response from the assistant (chat model). The "ai" role is used to set the model's previous response in the current request to maintain the coherence of the conversation.