In-depth analysis: Key points of LlamaIndex to achieve RAG reordering

In-depth analysis of how LlamaIndex improves document retrieval accuracy through RAG technology.
Core content:
1. The definition of re-ranking and its role in the RAG system
2. The necessity of re-ranking and its impact on system performance
3. Classification of re-ranking methods and LlamaIndex implementation cases
1. What is reordering?
Reranking is to perform a more detailed sorting on a batch of candidate documents (usually Top-k) obtained in the initial retrieval, so as to put the most relevant content at the front.
This is usually the second stage in a two-stage search process:
User query → Initial search (Faiss/Chroma) → Top-k candidate documents → Reranking model → Final ranking → Input to LLM
It is a "magnifying glass" that improves the accuracy of the RAG system and is also a key link in widening the gap between a good system and an average system. |
2. Why is reordering necessary?
Although initial retrieval (such as vector-based nearest neighbors) is fast, its matching ability is weak and may:
Rank similar but unrelated documents first
Missing documents that are truly highly relevant but not so “close” on the surface
The goals of reordering are:
Fine-grained understanding of the semantic match between user queries and documents
Eliminate irrelevant or less relevant documents
Improve the accuracy, context, and relevance of the final generated content
3. Classification of reordering methods
1. Traditional method (weak matching)
Rules-based (e.g. keyword matching, position preference)
BM25 re-ranking (same as initial search, just re-scoring and re-ranking)
2. Semantic method (strong matching)
Dual Encoder:
Encode the query and document separately, and then calculate the similarity (fast, but slightly crude)
Cross Encoder:
Take Query and Document as a pair of inputs, feed them into a BERT/Transformer model, and output a relevance score.
More sophisticated (because you can see the full text interaction), but slower to calculate , suitable for use with only a small number of candidate documents
IV. Commonly used re-ranking models
cross-encoder/ms-marco-MiniLM-L-6-v2 (huggingface)
BAAI/bge-reranker-large (launched by BAAI, strong in both Chinese and English, also available in multiple languages)
ColBERT (Efficient Cross-Modeling)
5. LlamaIndex implements reordering code
Prerequisite: There is already a large amount of data related to labor law in the Chroma vector database, which can be directly queried.
1. Initialize the reordering model
from llama_index.indices.postprocessor import SentenceTransformerRerank# Initialize the reranker reranker = SentenceTransformerRerank( model=r"D:\Test\LLMTrain\testllm\llm\BAAI\bge-reranker-v2-m3", top_n=3)
2. Build a query engine
question="How long is the maximum probation period of a labor contract?"# Create a query enginequery_engine = index.as_query_engine( similarity_top_k=10, # Initial vector recall number text_qa_template=response_template, node_postprocessors=[reranker] # Reranking stage)# Execute queryresponse = query_engine.query(question)# Display resultsprint(f"\nSmart assistant answer:\n{response.response}")print("\nSupport basis:")for idx, node in enumerate(response.source_nodes, 1): meta = node.metadata print(f"\n[{idx}] {meta['full_title']}") print(f" Source file: {meta['source_file']}") print(f" Legal name: {meta['law_name']}") print(f" Clause content: {node.text[:100]}...") print(f"Relevance score: {node.score:.4f}")
3. Complete sample code
import jsonimport timefrom pathlib import Pathfrom typing import List, Dictimport chromadbfrom llama_index.core import VectorStoreIndex, StorageContext, Settings, get_response_synthesizer,PromptTemplatefrom llama_index.core.schema import TextNodefrom llama_index.llms.huggingface import HuggingFaceLLMfrom llama_index.embeddings.huggingface import HuggingFaceEmbeddingfrom llama_index.vector_stores.chroma import ChromaVectorStorefrom llama_index.core.postprocessor import SentenceTransformerRerank # ================== Configuration Area ================== QA_TEMPLATE = ( "<|im_start|>system\n" "You are a professional assistant in the field of Chinese labor law and must strictly abide by the following rules:\n" "1. Only use the provided legal provisions to answer questions\n" "2. If the question is not related to labor law or is beyond the scope of the knowledge base, clearly state that it cannot be answered\n" "3. Indicate the source when quoting provisions\n\n" "Available legal provisions (a total of {context_count}):\n{context_str}\n<|im_end|>\n" "<|im_start|>user\nQuestion: {query_str}<|im_end|>\n" "<|im_start|>assistant\n")response_template = PromptTemplate(QA_TEMPLATE)class Config: RERANK_MODEL_PATH = r"D:\Test\LLMTrain\testllm\llm\BAAI\bge-reranker-v2-m3" # Add a new reranking model path EMBED_MODEL_PATH = r"D:\Test\LLMTrain\testllm\llm\BAAI\bge-m3" LLM_MODEL_PATH = r"D:\Test\LLMTrain\testllm\llm\Qwen\Qwen2___5-3B-Instruct" DATA_DIR = r"D:\Test\LLMTrain\day23\data" VECTOR_DB_DIR = r"D:\Test\LLMTrain\day23\chroma_db" PERSIST_DIR = r"D:\Test\LLMTrain\day23\storage" COLLECTION_NAME = "chinese_labor_laws" TOP_K = 10 # Expand the initial search quantity RERANK_TOP_K = 3 # Retain the quantity after reranking# Embedding model embed_model = HuggingFaceEmbedding( model_name=Config.EMBED_MODEL_PATH,)# LLMllm = HuggingFaceLLM( model_name=Config.LLM_MODEL_PATH, tokenizer_name=Config.LLM_MODEL_PATH, model_kwargs={ "trust_remote_code": True, }, tokenizer_kwargs={"trust_remote_code": True}, generate_kwargs={"temperature": 0.3})# Initialize reorderer (new) reranker = SentenceTransformerRerank( model=Config.RERANK_MODEL_PATH, top_n=Config.RERANK_TOP_K)Settings.embed_model = embed_modelSettings.llm = llmchroma_client = chromadb.PersistentClient(path=Config.VECTOR_DB_DIR)chroma_collection = chroma_client.get_or_create_collection( name=Config.COLLECTION_NAME, metadata={"hnsw:space": "cosine"})print("Load index from chromadb...")index = VectorStoreIndex.from_vector_store(ChromaVectorStore(chroma_collection=chroma_collection))question="What is the maximum probation period of a labor contract?"# Create a query enginequery_engine = index.as_query_engine( similarity_top_k=Config.TOP_K, # Initial vector recall quantity text_qa_template=response_template, node_postprocessors=[reranker] # Reranking phase)# Execute queryresponse = query_engine.query(question)# Display resultsprint(f"\nSmart assistant answer:\n{response.response}")print("\nSupport basis:")for idx, node in enumerate(response.source_nodes, 1): meta = node.metadata print(f"\n[{idx}] {meta['full_title']}") print(f" Source file: {meta['source_file']}") print(f" Legal name: {meta['law_name']}") print(f" Clause content: {node.text[:100]}...") print(f" Relevance score: {node.score:.4f}")
illustrate:
In other RAG frameworks (such as LangChain and Haystack), we see explicit retriever objects, but in LlamaIndex, this retriever is "embedded" in the QueryEngine, and generally there is no need to manually create a retriever.
For example, the following code:
query_engine = index.as_query_engine( similarity_top_k=15, node_postprocessors=[reranker], llm=OpenAI(model="gpt-3.5-turbo"))
Actually, I did 3 things:
(1) A Retriever is constructed internally:
similarity_top_k=15 specifies the number of documents to be initially recalled
By default, VectorIndexRetriever is used (which uses vectors internally for ANN retrieval)
(2) Send the results to Re-Ranker (Postprocessor) for processing
Re-rank the SentenceTransformerRerank you passed in
(3) Finally, the paper is sent to LLM to answer questions
Of course, you can also use retriever explicitly, such as the following code:
question="How long is the maximum probation period of a labor contract?"# 0-Create a retriever retriever = index.as_retriever( similarity_top_k=Config.TOP_K # Expand the number of initial searches)# 0-Create a response synthesizer response_synthesizer = get_response_synthesizer( text_qa_template=response_template, verbose=True)# 1. Initial retrieval initial_nodes = retriever.retrieve(question)# Save the initial score to metadata for node in initial_nodes: node.node.metadata['initial_score'] = node.score # 2. Reranking reranked_nodes = reranker.postprocess_nodes( initial_nodes, query_str=question)# 3. Synthesize the answer response = response_synthesizer.synthesize( question, nodes=reranked_nodes)# Display results (modify display logic)print(f"\nSmart assistant answer:\n{response.response}")print("\nSupport basis:")for idx, node in enumerate(reranked_nodes, 1): # Score acquisition method compatible with the new version of the API initial_score = node.metadata.get('initial_score', node.score) # Get the initial score rerank_score = node.score # Score after reranking meta = node.node.metadata print(f"\n[{idx}] {meta['full_title']}") print(f" Source file: {meta['source_file']}") print(f" Legal name: {meta['law_name']}") print(f" Initial relevance: {node.node.metadata['initial_score']:.4f}") print(f" Reranking score: {node.score:.4f}") print(f" Clause content: {node.node.text[:100]}...")
6. Filter out low-quality documents
After using the Re-Ranker in LlamaIndex , if you want to further filter out low-quality documents based on scores (e.g. score < 0.5), you need to write a custom NodePostprocessor, because the built-in SentenceTransformerRerank itself does not actively do filtering - it just reranks.
1. Create a filter module
File: score_threshold_filter.py
from typing import List , Optional
from pydantic import Field
from llama_index.core.schema import NodeWithScore, QueryBundle
from llama_index.core.postprocessor.types import BaseNodePostprocessor
class ScoreThresholdFilter ( BaseNodePostprocessor ):
"""
Custom node post-processor: Filter nodes based on score thresholds and provide a fallback mechanism if all nodes score too low.
parameter:
threshold (float): The score threshold for filtering. The default value is 0.5.
verbose (bool): Whether to print detailed filtering information, the default value is False.
"""
threshold: float = Field(default= 0.5 , description= "Filter score threshold" )
verbose: bool = Field(default= False , description= "Whether to print detailed filtering information" )
def _postprocess_nodes (
self,
nodes: List [NodeWithScore],
query_bundle: Optional [QueryBundle] = None ,
) -> List [NodeWithScore]:
if self.verbose:
print ( f"[ScoreThresholdFilter] Original number of nodes: { len (nodes)} " )
for idx, node in enumerate (nodes):
score_display = f" {node.score: .4 f} " if node.score is not None else "None"
print ( f "node {idx + 1 } : score = {score_display} " )
# Filter nodes with scores below the threshold
filtered_nodes = [
node for node in nodes
if node.score is None or node.score >= self.threshold
]
# If the number of nodes after filtering is less than 1
if len (filtered_nodes) < 1 :
if self.verbose:
print ( f"[ScoreThresholdFilter] All node scores are below the threshold {self.threshold} ," )
return filtered_nodes
if self.verbose:
print ( f"[ScoreThresholdFilter] Number of nodes after filtering: { len (filtered_nodes)} , threshold: {self.threshold} " )
for idx, node in enumerate (filtered_nodes):
score_display = f" {node.score: .4 f} " if node.score is not None else "None"
print ( f "Retain node {idx + 1 } : score = {score_display} " )
return filtered_nodes
Please note the following:
BaseNodePostprocessor is the base class provided by LlamaIndex for customizing node postprocessing logic.
Pydantic's Field is used to declare class fields and provide default values and descriptions.
Rewrote the _postprocess_nodes method, which is an abstract method required by BaseNodePostprocessor to process the node list.
2. Use filters
Adjust the following code in the main query retrieval method:
from score_threshold_filter import ScoreThresholdFilter# Import custom filters, scores below 0.6 will be removed score_filter = ScoreThresholdFilter(threshold=0.6, verbose=True)# Create query engine query_engine = index.as_query_engine( similarity_top_k=Config.TOP_K, # Initial vector recall number text_qa_template=response_template, node_postprocessors=[reranker,score_filter] # Reorder + filter low scores)
Effect description:
The first post-processor (Re-ranker) will sort the recalled documents and append the scores
The second post-processor (Filter) will remove documents with too low a score
Finally, the remaining documents will be spliced into Prompt and fed to LLM
The score is calculated by Re-ranker, not the similarity score or embedding distance.
3. Complete sample code
import json
import time
from pathlib import Path
from typing import List , Dict
import chromadb
from llama_index.core import VectorStoreIndex, StorageContext, Settings, get_response_synthesizer,PromptTemplate
from llama_index.core.schema import TextNode
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core.postprocessor import SentenceTransformerRerank
from score_threshold_filter import ScoreThresholdFilter
# ================== Configuration Area==================
QA_TEMPLATE = (
"<|im_start|>system\n"
"You are a professional assistant in the field of Chinese labor law and must strictly follow the following rules:\n"
"1. Only answer questions using the legal text provided\n"
"2. If the question is not related to labor law or is beyond the scope of the knowledge base, clearly inform that it cannot be answered\n"
"3. Indicate the source when quoting articles\n\n"
"Available legal provisions ({context_count} in total):\n{context_str}\n<|im_end|>\n"
"<|im_start|>user\nQuestion: {query_str}<|im_end|>\n"
"<|im_start|>assistant\n"
)
response_template = PromptTemplate(QA_TEMPLATE)
class Config :
RERANK_MODEL_PATH = r"D:\Test\LLMTrain\testllm\llm\BAAI\bge-reranker-v2-m3" # Add a new reranking model path
EMBED_MODEL_PATH = r"D:\Test\LLMTrain\testllm\llm\BAAI\bge-m3"
LLM_MODEL_PATH = r"D:\Test\LLMTrain\testllm\llm\Qwen\Qwen2___5-3B-Instruct"
DATA_DIR = r"D:\Test\LLMTrain\day23\data"
VECTOR_DB_DIR = r"D:\Test\LLMTrain\day23\chroma_db"
PERSIST_DIR = r"D:\Test\LLMTrain\day23\storage"
COLLECTION_NAME = "chinese_labor_laws"
TOP_K = 10 # Expand the number of initial searches
RERANK_TOP_K = 3 # Number of tops to keep after reranking
# Embedding Model
embed_model = HuggingFaceEmbedding(
model_name=Config.EMBED_MODEL_PATH,
)
#LLM
llm = HuggingFaceLLM(
model_name=Config.LLM_MODEL_PATH,
tokenizer_name=Config.LLM_MODEL_PATH,
model_kwargs={
"trust_remote_code" : True ,
},
tokenizer_kwargs={ "trust_remote_code" : True },
generate_kwargs={ "temperature" : 0.3 }
)
# Initialize the resequencer (new)
reranker = SentenceTransformerRerank(
model=Config.RERANK_MODEL_PATH,
top_n=Config.RERANK_TOP_K
)
Settings.embed_model = embed_model
Settings.llm = llm
chroma_client = chromadb.PersistentClient(path=Config.VECTOR_DB_DIR)
chroma_collection = chroma_client.get_or_create_collection(
name=Config.COLLECTION_NAME,
metadata={ "hnsw:space" : "cosine" })
print ( "Loading index from chromadb..." )
index = VectorStoreIndex.from_vector_store(ChromaVectorStore(chroma_collection=chroma_collection))
#question="What is the maximum probation period of a labor contract?"
question= "What is xtuner?"
# Import custom filters, scores below 0.6 will be removed
score_filter = ScoreThresholdFilter(threshold= 0.6 , verbose= True )
# Create a query engine
query_engine = index.as_query_engine(
similarity_top_k=Config.TOP_K, # Initial vector recall number
text_qa_template=response_template,
node_postprocessors=[reranker,score_filter], # reranking + filtering low scores
)
# Execute the query
response = query_engine.query(question)
if not response.response or response.response.strip() == "" or response.response.strip().lower() == "empty response" :
print ( "No relevant data found" )
else :
# Display results
print ( f"\nSmart assistant answers: \n {response.response} " )
print ( "\nSupport basis: " )
for idx, node in enumerate (response.source_nodes, 1 ):
meta = node.metadata
print ( f"\n[ {idx} ] {meta[ 'full_title' ]} " )
print ( f" Source file: {meta[ 'source_file' ]} " )
print ( f" Law name: {meta[ 'law_name' ]} " )
print ( f" Clause content: {node.text[: 100 ]} ..." )
print ( f" Relevance score: {node.score: .4 f} " )
The execution results are as follows:
7. RAG Process in LlamaIndex
The process is as follows:
1️⃣ User input Query ↓2️⃣ Retriever Recall related documents (vector/BM25/multimodal) ↓3️⃣ NodePostprocessor Process the recall results - Semantic reordering - Deduplication, filtering - Segmentation, merging ↓4️⃣ ResponseSynthesizer Splices context and constructs Prompt - response_mode: compact / refine / tree / summarize ↓5️⃣ LLM Generates answers (based on prompt + document)