RAG advanced technology! You must know these 10 methods.

Written by
Caleb Hayes
Updated on:July-03rd-2025
Recommendation

Master RAG technology, improve the performance of AI question-answering systems, and cope with complex queries and multi-round dialogue challenges.

Core content:
1. Challenges faced by RAG systems in complex queries and multi-round dialogues
2. The architecture and limitations of basic RAG systems
3. 10 advanced RAG technologies to optimize indexing, retrieval, and generation

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

In today's era of information explosion, AI systems have penetrated into every aspect of our lives, from medical and health assistants to educational tutoring tools to enterprise knowledge management robots. AI is helping us acquire and process knowledge more efficiently. However, as application scenarios become more complex, traditional AI systems face many challenges: How to generate truly relevant answers? How to understand complex multi-round conversations? How to avoid confidently outputting wrong information? These problems are particularly prominent in systems based on RAG (Retrieval-Augmented Generation).

RAG combines the power of document retrieval with the fluency of language generation, allowing the system to give informed answers based on context. However, basic RAG systems often fail to handle complex queries, multi-round conversations, and domain-specific expertise, and experience “hallucinations” (generating wrong information) or loss of context. So, how can we upgrade the RAG system to make it smarter and more reliable? Today, let’s explore how to improve the performance of question-answering systems through advanced RAG technology!

Limitations of Basic RAG

Let's first take a look at the architecture of the basic RAG system: its workflow is roughly as follows: first, the document is loaded, split into small blocks through various block technologies, and then these small blocks are converted into vectors using the embedding model and stored in the vector database. When the user asks a question, the system retrieves document fragments related to the question from the vector database, and then passes these fragments and the question to the language model to finally generate an answer.

Sounds simple, doesn't it? But it is this very simplicity that leads to many problems with the basic RAG system:

  • Hallucination problem : The model may generate content that is irrelevant to the original document or even wrong. This is a fatal flaw in fields such as medicine or law where accuracy is extremely important.
  • Lack of domain specificity : Basic RAG systems often fail when dealing with complex topics in specific domains because they retrieve irrelevant or inaccurate information.
  • Multi-round dialogue dilemma : In multi-round dialogues, the basic RAG system can easily lose context, resulting in fragmented answers that cannot meet user needs.

So, how can we break through these limitations? This requires the introduction of advanced RAG technology to optimize and upgrade all aspects of the RAG system - indexing, retrieval and generation!

Indexing and chunking: Building a solid foundation

A good index is the core of the RAG system. We must first consider how to efficiently import, split, and store data. Next, let's look at several advanced indexing and chunking methods.

1. HNSW: A powerful tool for efficient retrieval

The HNSW (Hierarchical Navigable Small Worlds) algorithm is a powerful tool for quickly finding similar items in large data sets. It builds a graph-based structure that can efficiently find approximate nearest neighbors (ANN). Specifically, it has the following key features:

  • Proximity Graph : HNSW builds a graph where each point is connected to nearby points, which makes the search process more efficient.
  • Hierarchical structure : The algorithm organizes points into layers, with the top layer connecting distant points and the bottom layer connecting closer points, thus speeding up the search.
  • Greedy Routing : When searching, HNSW starts from a point in the upper layer and gradually moves to the lower layers until it finds the local minimum, which greatly reduces the time required to find similar items.

In practical applications, we can optimize the performance of HNSW by setting parameters (such as the number of neighbors of each node, the number of neighbors considered when building the graph, etc.) Through HNSW, we can quickly and accurately find the document fragments most relevant to the user's question in massive data, providing a solid foundation for subsequent answer generation.

Hands-on experience with HNSW

Next, let's implement the HNSW algorithm through code. Here we use the FAISS library, which is an efficient similarity search library and is very suitable for use with HNSW.

import  faiss
import  numpy  as  np

# Set HNSW parameters
d =  128 # Dimension of the vector  
M =  32 # The number of neighbors for each node   

# Initialize HNSW index
index = faiss.IndexHNSWFlat(d, M)

# Set the efConstruction parameters to control the number of neighbors considered when building the index
efConstruction =  200
index.hnsw.efConstruction = efConstruction

# Generate random data and add it to the index
n =  10000 # Number of vectors to index  
xb = np.random.random((n, d)).astype( 'float32' )
index.add(xb)   # Build index

# Set efSearch parameters to affect the search process
efSearch =  100
index.hnsw.efSearch = efSearch

# Perform a search
nq =  5 # The number of query vectors  
xq = np.random.random((nq, d)).astype( 'float32' )
k =  5 # retrieve the number of nearest neighbors   
distances, indices = index.search(xq, k)

# Output results
print( "Query vector:\n" , xq)
print( "\nNearest neighbor index:\n" , indices)
print( "\nNearest neighbor distance:\n" , distances)

Through the above code, we can see the efficiency and accuracy of HNSW in processing large-scale data sets. It can quickly find the document fragments that are most similar to the query vector and provide high-quality input for the subsequent language model generation stage.

2. Semantic Chunking: Make Information More Meaningful

Traditional chunking methods usually split text based on a fixed size, but this method may break a complete concept or information into pieces. Semantic chunking is different. It divides chunks according to the meaning of the text, and each chunk represents a coherent unit of information. The specific operation is to calculate the cosine distance between sentence embeddings. If two sentences are semantically similar (below a certain threshold), they are classified into the same chunk. The advantage of this method is that it can generate more meaningful and coherent chunks, thereby improving the accuracy of retrieval. However, it requires the use of encoders based on BERT, etc., and the computational cost is relatively high.

Hands-on practice of semantic chunking

Next, we implement semantic segmentation through code. Here we use the LangChain library SemanticChunker Class, which leverages OpenAI's embedding model to implement semantic chunking.

from  langchain_experimental.text_splitter  import  SemanticChunker
from  langchain_openai.embeddings  import  OpenAIEmbeddings

# Initialize the semantic chunker
text_splitter = SemanticChunker(OpenAIEmbeddings())

# Split the document into semantically related chunks
docs = text_splitter.create_documents([document])
print(docs[ 0 ].page_content)

From the above code, we can see that semantic chunking can generate more meaningful chunks based on the semantic content of the text, which is very helpful for subsequent retrieval and generation steps.

3. Language model-based chunking: Accurately capture text structure

This method uses a powerful language model (such as a model with 7 billion parameters) to process text, split it into complete sentences, and then combine these sentences into blocks, which not only ensures the integrity of each block, but also takes into account the context information. Although this method has a large amount of calculation, it can flexibly adjust the block method according to the specific content of the text and generate high-quality blocks, which is particularly suitable for application scenarios with high requirements for text structure.

Hands-on practice of language model-based chunking

Next, we implement language model-based chunking through code. Here we use OpenAI's GPT-4o model to generate context information for each chunk through asynchronous calls.

import  asyncio
from  langchain_openai  import  ChatOpenAI

async def generate_contexts (document, chunks) :  
    async def process_chunk (chunk) :  
        response =  await  client.chat.completions.create(
            model = "gpt-4o" ,
            messages=[
                { "role""system""content""Generate a brief context explaining how this chunk relates to the full document." },
                { "role""user""content"f"<document> \n {document}  \n</document> \nHere is the chunk we want to situate within the whole document \n<chunk> \n {chunk}  \n</chunk> \nPlease give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else." }
            ],
            temperature = 0.3 ,
            max_tokens = 100
        )
        context = response.choices[ 0 ].message.content
        return f" {context} {chunk} "  
    
    # Process all chunks in parallel
    contextual_chunks =  await  asyncio.gather(
        *[process_chunk(chunk)  for  chunk  in  chunks]
    )
    return  contextual_chunks

Through the above code, we can see that the language model-based chunking can generate high-quality chunks and generate context information for each chunk, which is very helpful for subsequent retrieval and generation steps.

4. Leverage metadata: Add more context to your search

Metadata can provide additional contextual information for documents, such as date, patient age, previous medical history, etc. When searching, by filtering these metadata, we can exclude irrelevant information and make the search results more accurate. For example, in the medical field, if you query content related to children, you can directly filter out records of patients over 18 years old. When indexing, storing metadata together with text can greatly improve the efficiency and relevance of retrieval.

Hands-on practice with metadata

Next, we implement the use of metadata through code. Here we use the LangChain library Document Class, which allows us to store metadata in documents.

from  langchain_core.documents  import  Document

# Create a document with metadata
doc = Document(
    page_content= "This is a sample document." ,
    metadata={ "id""doc1""source""https://example.com" }
)

# Print document content and metadata
print(doc.page_content)
print(doc.metadata)

From the above code, we can see that metadata can provide more contextual information for documents, which is very helpful for subsequent retrieval and generation.

Search: Accurately locate key information

Retrieval is a key part of the RAG system, which determines whether we can find documents that are truly relevant to user questions from massive data. Next, let's look at several techniques to improve retrieval performance.

5. Hybrid search: the perfect combination of semantics and keywords

Hybrid search combines vector search (semantic search) and keyword search to take advantage of both. In some fields, such as AI technology, many terms are specific keywords, such as algorithm names, technical terms, etc. Using vector search alone may miss this important information, while keyword search ensures that these key terms are taken into consideration. By running these two search methods at the same time and merging and sorting the results according to the weight system, we can get a more comprehensive and accurate list of search results.

Hands-on experience with hybrid search

Next, we implement hybrid search through code. Here we use the LangChain library WeaviateHybridSearchRetriever Class that combines the vector search and keyword search capabilities of the Weaviate vector database.

from  langchain_community.retrievers  import  WeaviateHybridSearchRetriever

# Initialize the hybrid search retriever
retriever = WeaviateHybridSearchRetriever(
    client=client,
    index_name = "LangChain" ,
    text_key = "text" ,
    attributes=[],
    create_schema_if_missing= True ,
)

# Perform a hybrid search
results = retriever.invoke( "the ethical implications of AI" )
print(results)

From the above code, we can see that hybrid search can combine the advantages of vector search and keyword search to generate more comprehensive and accurate search results.

6. Query rewriting: making questions more “friendly”

Questions asked by humans are often not in the form that is best suited for databases or language models to understand. By rewriting queries using language models, the results of retrieval can be significantly improved. For example, rewriting "What are AI agents and why are they the next big thing in 2025" into "AI agent big things 2025" is more in line with the database's retrieval logic. In addition, you can also optimize the interaction with the language model by rewriting the prompt words to improve the quality and accuracy of the results.

Hands-on Query Rewriting

Next, we implement query rewriting through code. Here we use the LangChain library ChatOpenAI Class, which allows us to rewrite queries using OpenAI's language model.

from  langchain_openai  import  ChatOpenAI

# Initialize the language model
chatgpt = ChatOpenAI(model_name= "gpt-4o" , temperature= 0 )

# Rewrite the query
query =  "what are AI agents and why they are the next big thing in 2025"
rewritten_query = chatgpt.invoke(query)
print(rewritten_query)

From the above code, we can see that query rewriting can transform human questions into a form that is more suitable for database and language model understanding, thereby improving the retrieval effect.

7. Multi-query retrieval: mining information from different angles

Slightly different wording of a query can produce very different results. The Multi-Query Retriever uses a large language model (LLM) to generate multiple queries from different angles based on user input, retrieves relevant documents for each query, and aggregates the results from all queries to provide a broader set of relevant documents. This approach increases the probability of finding useful information without requiring extensive manual tuning.

Hands-on practice with multi-query retrieval

Next, we implement multi-query retrieval through code. Here we use the LangChain library MultiQueryRetriever Class, which allows us to leverage OpenAI's language model to generate multiple queries and retrieve relevant documents from the Chroma vector database.

from  langchain.retrievers.multi_query  import  MultiQueryRetriever

# Initialize the multi-query retriever
mq_retriever = MultiQueryRetriever.from_llm(
    retriever=similarity_retriever3, llm=chatgpt,
    include_original= True
)

# Perform multiple query retrieval
query =  "what is the capital of India?"
docs = mq_retriever.invoke(query)
print(docs)

From the above code, we can see that multi-query retrieval can generate queries from multiple angles and retrieve relevant documents from the vector database, thereby increasing the probability of finding useful information.

Generate: Create high-quality answers

Finally, we come to the generation phase of the RAG system. The goal of this phase is to provide the language model with context that is as relevant to the question as possible, avoiding irrelevant information that causes "hallucinations". Here are some tips to improve the generation quality.

8. Automatic cropping: remove irrelevant information

Automatic trimming technology can filter out information retrieved from the database that is irrelevant to the question and prevent the language model from being misled. The specific operation is to find a critical point where the similarity score drops significantly during retrieval, and exclude objects with scores below this critical point, thereby ensuring that the information passed to the language model is the most relevant.

Hands-on practice with automatic cropping

Next, we implement automatic cropping through code. Here we use the LangChain library PineconeVectorStore Class that allows us to perform similarity searches using the Pinecone vector database and filter information based on similarity scores.

from  langchain_pinecone  import  PineconeVectorStore
from  langchain_openai  import  OpenAIEmbeddings

# Initialization vector storage
vectorstore = PineconeVectorStore.from_documents(
    docs, index_name= "sample" , embedding=OpenAIEmbeddings()
)

# Perform a similarity search and get the similarity score
docs, scores = vectorstore.similarity_search_with_score( "dinosaur" )
for  doc, score  in  zip(docs, scores):
    doc.metadata[ "score" ] = score
print(docs)

From the above code, we can see that automatic cropping can filter out irrelevant information based on the similarity score, thereby improving the quality of information passed to the language model.

9. Re-order: Prioritize important information

Reranking techniques use a more advanced model (usually a cross encoder) to re-evaluate and re-rank the objects initially retrieved. It considers the pairwise similarity between the query and each object, re-determines relevance, and places the most relevant documents at the top. This way, the language model receives higher quality data and generates more accurate responses.

Hands-on reordering

Next, we implement the reordering through code. Here we use the LangChain library FlashrankRerank Class, which allows us to leverage advanced models to re-evaluate and rank the retrieved documents.

from  langchain.retrievers  import  ContextualCompressionRetriever
from  langchain.retrievers.document_compressors  import  FlashrankRerank

# Initialize the resequencer
compressor = FlashrankRerank()
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

# Perform reordering
query =  "What did the president say about Ketanji Jackson Brown"
compressed_docs = compression_retriever.invoke(query)
print([doc.metadata[ "id"for  doc  in  compressed_docs])
print(compressed_docs)

From the above code, we can see that reranking can improve the quality of information passed to the language model by reranking based on the similarity between the query and the document.

10. Fine-tune the language model: make the model understand your field better

Fine-tuning a pre-trained language model can significantly improve retrieval performance. In specific fields (such as medicine), you can choose a model pre-trained on relevant data (such as the MedCPT series), collect your own data, create positive and negative sample pairs for fine-tuning, and let the model learn specific relationships in the field. Fine-tuned models perform better in retrieval and generation tasks in specific fields.

Hands-on practice of fine-tuning language models

Next, we will use code to fine-tune the language model. Here we use the LangChain library ChatOpenAI class, which allows us to fine-tune using OpenAI's language model.

from  langchain_openai  import  ChatOpenAI

# Initialize the language model
chatgpt = ChatOpenAI(model_name= "gpt-4o" , temperature= 0 )

# Fine-tune the language model
# Here you need to provide your own dataset for fine-tuning
# For example, fine-tuning using a dataset in the medical field
# fine_tuned_model = chatgpt.fine_tune(dataset)

From the above code, we can see that fine-tuning the language model can significantly improve the performance of the model in a specific field, thereby improving the quality of generated answers.

Advanced RAG technology: making AI answers more reliable

Through the above series of advanced RAG technologies, we can optimize and upgrade each link of the RAG system - indexing, retrieval and generation - to improve the overall performance of the system. Whether it is a medical health assistant, an educational tutoring tool, or an enterprise knowledge management robot, these technologies can make AI systems more adept at handling complex information needs and generate more accurate, reliable and contextual answers.

In short, as application scenarios become more complex, AI systems need to continue to evolve. Advanced RAG technology provides us with an effective way to build smarter and more powerful question-answering systems, making AI truly our right-hand man for acquiring knowledge and solving problems!