In-depth analysis: Key points of LlamaIndex to achieve RAG reordering

Written by
Jasper Cole
Updated on:June-28th-2025
Recommendation

In-depth analysis of how LlamaIndex improves document retrieval accuracy through RAG technology.

Core content:
1. The definition of re-ranking and its role in the RAG system
2. The necessity of re-ranking and its impact on system performance
3. Classification of re-ranking methods and LlamaIndex implementation cases

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

1. What is reordering?

Reranking is to perform a more detailed sorting on a batch of candidate documents (usually Top-k) obtained in the initial retrieval, so as to put the most relevant content at the front.

This is usually the second stage in a two-stage search process:

User query → Initial search (Faiss/Chroma) → Top-k candidate documents → Reranking model → Final ranking → Input to LLM
“Re-ranking means that after the model has initially found possible answers, it must then carefully determine which one is most relevant.”

It is a "magnifying glass" that improves the accuracy of the RAG system and is also a key link in widening the gap between a good system and an average system.


2. Why is reordering necessary?

Although initial retrieval (such as vector-based nearest neighbors) is fast, its matching ability is weak and may:

  • Rank similar but unrelated documents first

  • Missing documents that are truly highly relevant but not so “close” on the surface


The goals of reordering are:

  • Fine-grained understanding of the semantic match between user queries and documents

  • Eliminate irrelevant or less relevant documents

  • Improve the accuracy, context, and relevance of the final generated content


3. Classification of reordering methods

1. Traditional method (weak matching)

  • Rules-based (e.g. keyword matching, position preference)

  • BM25 re-ranking (same as initial search, just re-scoring and re-ranking)


2. Semantic method (strong matching)

Dual Encoder:

Encode the query and document separately, and then calculate the similarity (fast, but slightly crude)

Cross Encoder:

  • Take Query and Document as a pair of inputs, feed them into a BERT/Transformer model, and output a relevance score.

  • More sophisticated (because you can see the full text interaction), but slower to calculate , suitable for use with only a small number of candidate documents


IV. Commonly used re-ranking models

  • cross-encoder/ms-marco-MiniLM-L-6-v2 (huggingface)

  • BAAI/bge-reranker-large (launched by BAAI, strong in both Chinese and English, also available in multiple languages)

  • ColBERT (Efficient Cross-Modeling)


5. LlamaIndex implements reordering code

Prerequisite: There is already a large amount of data related to labor law in the Chroma vector database, which can be directly queried.

1. Initialize the reordering model

from llama_index.indices.postprocessor import SentenceTransformerRerank# Initialize the reranker reranker = SentenceTransformerRerank( model=r"D:\Test\LLMTrain\testllm\llm\BAAI\bge-reranker-v2-m3", top_n=3)

2. Build a query engine

question="How long is the maximum probation period of a labor contract?"# Create a query enginequery_engine = index.as_query_engine( similarity_top_k=10, # Initial vector recall number text_qa_template=response_template, node_postprocessors=[reranker] # Reranking stage)# Execute queryresponse = query_engine.query(question)# Display resultsprint(f"\nSmart assistant answer:\n{response.response}")print("\nSupport basis:")for idx, node in enumerate(response.source_nodes, 1): meta = node.metadata print(f"\n[{idx}] {meta['full_title']}") print(f" Source file: {meta['source_file']}") print(f" Legal name: {meta['law_name']}") print(f" Clause content: {node.text[:100]}...") print(f"Relevance score: {node.score:.4f}")

3. Complete sample code

import jsonimport timefrom pathlib import Pathfrom typing import List, Dictimport chromadbfrom llama_index.core import VectorStoreIndex, StorageContext, Settings, get_response_synthesizer,PromptTemplatefrom llama_index.core.schema import TextNodefrom llama_index.llms.huggingface import HuggingFaceLLMfrom llama_index.embeddings.huggingface import HuggingFaceEmbeddingfrom llama_index.vector_stores.chroma import ChromaVectorStorefrom llama_index.core.postprocessor import SentenceTransformerRerank # ================== Configuration Area ================== QA_TEMPLATE = ( "<|im_start|>system\n" "You are a professional assistant in the field of Chinese labor law and must strictly abide by the following rules:\n" "1. Only use the provided legal provisions to answer questions\n" "2. If the question is not related to labor law or is beyond the scope of the knowledge base, clearly state that it cannot be answered\n" "3. Indicate the source when quoting provisions\n\n" "Available legal provisions (a total of {context_count}):\n{context_str}\n<|im_end|>\n" "<|im_start|>user\nQuestion: {query_str}<|im_end|>\n" "<|im_start|>assistant\n")response_template = PromptTemplate(QA_TEMPLATE)class Config: RERANK_MODEL_PATH = r"D:\Test\LLMTrain\testllm\llm\BAAI\bge-reranker-v2-m3" # Add a new reranking model path EMBED_MODEL_PATH = r"D:\Test\LLMTrain\testllm\llm\BAAI\bge-m3" LLM_MODEL_PATH = r"D:\Test\LLMTrain\testllm\llm\Qwen\Qwen2___5-3B-Instruct" DATA_DIR = r"D:\Test\LLMTrain\day23\data" VECTOR_DB_DIR = r"D:\Test\LLMTrain\day23\chroma_db" PERSIST_DIR = r"D:\Test\LLMTrain\day23\storage" COLLECTION_NAME = "chinese_labor_laws" TOP_K = 10 # Expand the initial search quantity RERANK_TOP_K = 3 # Retain the quantity after reranking# Embedding model embed_model = HuggingFaceEmbedding( model_name=Config.EMBED_MODEL_PATH,)# LLMllm = HuggingFaceLLM( model_name=Config.LLM_MODEL_PATH, tokenizer_name=Config.LLM_MODEL_PATH,    model_kwargs={ "trust_remote_code": True, }, tokenizer_kwargs={"trust_remote_code": True}, generate_kwargs={"temperature": 0.3})# Initialize reorderer (new) reranker = SentenceTransformerRerank( model=Config.RERANK_MODEL_PATH, top_n=Config.RERANK_TOP_K)Settings.embed_model = embed_modelSettings.llm = llmchroma_client = chromadb.PersistentClient(path=Config.VECTOR_DB_DIR)chroma_collection = chroma_client.get_or_create_collection( name=Config.COLLECTION_NAME, metadata={"hnsw:space": "cosine"})print("Load index from chromadb...")index = VectorStoreIndex.from_vector_store(ChromaVectorStore(chroma_collection=chroma_collection))question="What is the maximum probation period of a labor contract?"# Create a query enginequery_engine = index.as_query_engine( similarity_top_k=Config.TOP_K, # Initial vector recall quantity text_qa_template=response_template, node_postprocessors=[reranker] # Reranking phase)# Execute queryresponse = query_engine.query(question)# Display resultsprint(f"\nSmart assistant answer:\n{response.response}")print("\nSupport basis:")for idx, node in enumerate(response.source_nodes, 1): meta = node.metadata print(f"\n[{idx}] {meta['full_title']}") print(f" Source file: {meta['source_file']}") print(f" Legal name: {meta['law_name']}") print(f" Clause content: {node.text[:100]}...") print(f" Relevance score: {node.score:.4f}")

illustrate:

In other RAG frameworks (such as LangChain and Haystack), we see explicit retriever objects, but in LlamaIndex, this retriever is "embedded" in the QueryEngine, and generally there is no need to manually create a retriever.

For example, the following code:

query_engine = index.as_query_engine( similarity_top_k=15, node_postprocessors=[reranker], llm=OpenAI(model="gpt-3.5-turbo"))

Actually, I did 3 things:

(1) A Retriever is constructed internally:

  • similarity_top_k=15 specifies the number of documents to be initially recalled

  • By default, VectorIndexRetriever is used (which uses vectors internally for ANN retrieval)


(2) Send the results to Re-Ranker (Postprocessor) for processing

Re-rank the SentenceTransformerRerank you passed in

(3) Finally, the paper is sent to LLM to answer questions

Of course, you can also use retriever explicitly, such as the following code:

question="How long is the maximum probation period of a labor contract?"# 0-Create a retriever retriever = index.as_retriever( similarity_top_k=Config.TOP_K # Expand the number of initial searches)# 0-Create a response synthesizer response_synthesizer = get_response_synthesizer( text_qa_template=response_template, verbose=True)# 1. Initial retrieval initial_nodes = retriever.retrieve(question)# Save the initial score to metadata for node in initial_nodes: node.node.metadata['initial_score'] = node.score # 2. Reranking reranked_nodes = reranker.postprocess_nodes( initial_nodes, query_str=question)# 3. Synthesize the answer response = response_synthesizer.synthesize( question, nodes=reranked_nodes)# Display results (modify display logic)print(f"\nSmart assistant answer:\n{response.response}")print("\nSupport basis:")for idx, node in enumerate(reranked_nodes, 1): # Score acquisition method compatible with the new version of the API initial_score = node.metadata.get('initial_score', node.score) # Get the initial score rerank_score = node.score # Score after reranking meta = node.node.metadata print(f"\n[{idx}] {meta['full_title']}") print(f" Source file: {meta['source_file']}") print(f" Legal name: {meta['law_name']}") print(f" Initial relevance: {node.node.metadata['initial_score']:.4f}") print(f" Reranking score: {node.score:.4f}") print(f" Clause content: {node.node.text[:100]}...")

6. Filter out low-quality documents

After using the Re-Ranker in LlamaIndex   , if you want to further filter out low-quality documents based on scores (e.g. score < 0.5), you need to write a custom NodePostprocessor, because the built-in SentenceTransformerRerank itself does not actively do filtering - it just reranks.

1. Create a filter module

File: score_threshold_filter.py

from  typing  import  ListOptionalfrom  pydantic  import  Fieldfrom  llama_index.core.schema  import  NodeWithScore, QueryBundlefrom  llama_index.core.postprocessor.types  import  BaseNodePostprocessor
class  ScoreThresholdFilter ( BaseNodePostprocessor ):    """    Custom node post-processor: Filter nodes based on score thresholds and provide a fallback mechanism if all nodes score too low.
    parameter:        threshold (float): The score threshold for filtering. The default value is 0.5.        verbose (bool): Whether to print detailed filtering information, the default value is False.    """
    threshold:  float  = Field(default= 0.5 , description= "Filter score threshold" )    verbose:  bool  = Field(default= False , description= "Whether to print detailed filtering information" )
    def  _postprocess_nodes (        self,        nodes:  List [NodeWithScore],        query_bundle:  Optional [QueryBundle] =  None ,    ) ->  List [NodeWithScore]:        if  self.verbose:            print ( f"[ScoreThresholdFilter] Original number of nodes:  { len (nodes)} " )            for  idx, node  in  enumerate (nodes):                score_display =  f" {node.score: .4 f} "  if  node.score  is  not  None  else  "None"                print ( f "node  {idx +  1 } : score =  {score_display} " )
        # Filter nodes with scores below the threshold        filtered_nodes = [            node  for  node  in  nodes            if  node.score  is  None  or  node.score >= self.threshold        ]
        # If the number of nodes after filtering is less than 1        if  len (filtered_nodes) <  1 :            if  self.verbose:                print ( f"[ScoreThresholdFilter] All node scores are below the threshold  {self.threshold} ," )            return  filtered_nodes        if  self.verbose:            print ( f"[ScoreThresholdFilter] Number of nodes after filtering:  { len (filtered_nodes)} , threshold:  {self.threshold} " )            for  idx, node  in  enumerate (filtered_nodes):                score_display =  f" {node.score: .4 f} "  if  node.score  is  not  None  else  "None"                print ( f "Retain node  {idx +  1 } : score =  {score_display} " )
        return  filtered_nodes

Please note the following:

  • BaseNodePostprocessor is the base class provided by LlamaIndex for customizing node postprocessing logic.

  • Pydantic's Field is used to declare class fields and provide default values ​​and descriptions.

  • Rewrote  the _postprocess_nodes  method, which is an abstract method required by BaseNodePostprocessor to process the node list.


2. Use filters

Adjust the following code in the main query retrieval method:

from score_threshold_filter import ScoreThresholdFilter# Import custom filters, scores below 0.6 will be removed score_filter = ScoreThresholdFilter(threshold=0.6, verbose=True)# Create query engine query_engine = index.as_query_engine( similarity_top_k=Config.TOP_K, # Initial vector recall number text_qa_template=response_template, node_postprocessors=[reranker,score_filter] # Reorder + filter low scores)

Effect description:

  • The first post-processor (Re-ranker) will sort the recalled documents and append the scores

  • The second post-processor (Filter) will remove documents with too low a score

  • Finally, the remaining documents will be spliced ​​into Prompt and fed to LLM

The score is calculated by Re-ranker, not the similarity score or embedding distance.

3. Complete sample code

import  jsonimport  timefrom  pathlib  import  Pathfrom  typing  import  ListDict
import  chromadbfrom  llama_index.core  import  VectorStoreIndex, StorageContext, Settings, get_response_synthesizer,PromptTemplatefrom  llama_index.core.schema  import  TextNodefrom  llama_index.llms.huggingface  import  HuggingFaceLLMfrom  llama_index.embeddings.huggingface  import  HuggingFaceEmbeddingfrom  llama_index.vector_stores.chroma  import  ChromaVectorStorefrom  llama_index.core.postprocessor  import  SentenceTransformerRerankfrom   score_threshold_filter  import  ScoreThresholdFilter
# ================== Configuration Area==================QA_TEMPLATE = (    "<|im_start|>system\n"    "You are a professional assistant in the field of Chinese labor law and must strictly follow the following rules:\n"    "1. Only answer questions using the legal text provided\n"    "2. If the question is not related to labor law or is beyond the scope of the knowledge base, clearly inform that it cannot be answered\n"    "3. Indicate the source when quoting articles\n\n"    "Available legal provisions ({context_count} in total):\n{context_str}\n<|im_end|>\n"    "<|im_start|>user\nQuestion: {query_str}<|im_end|>\n"    "<|im_start|>assistant\n")response_template = PromptTemplate(QA_TEMPLATE)class  Config :
    RERANK_MODEL_PATH =  r"D:\Test\LLMTrain\testllm\llm\BAAI\bge-reranker-v2-m3"   # Add a new reranking model path    EMBED_MODEL_PATH =  r"D:\Test\LLMTrain\testllm\llm\BAAI\bge-m3"    LLM_MODEL_PATH =  r"D:\Test\LLMTrain\testllm\llm\Qwen\Qwen2___5-3B-Instruct"
    DATA_DIR =  r"D:\Test\LLMTrain\day23\data"    VECTOR_DB_DIR =  r"D:\Test\LLMTrain\day23\chroma_db"    PERSIST_DIR =  r"D:\Test\LLMTrain\day23\storage"
    COLLECTION_NAME =  "chinese_labor_laws"    TOP_K =  10   # Expand the number of initial searches    RERANK_TOP_K =  3   # Number of tops to keep after reranking
# Embedding Modelembed_model = HuggingFaceEmbedding(    model_name=Config.EMBED_MODEL_PATH,)
#LLMllm = HuggingFaceLLM(    model_name=Config.LLM_MODEL_PATH,    tokenizer_name=Config.LLM_MODEL_PATH,    model_kwargs={        "trust_remote_code"True ,    },    tokenizer_kwargs={ "trust_remote_code"True },    generate_kwargs={ "temperature"0.3 })
# Initialize the resequencer (new)reranker = SentenceTransformerRerank(    model=Config.RERANK_MODEL_PATH,    top_n=Config.RERANK_TOP_K)
Settings.embed_model = embed_modelSettings.llm = llm
chroma_client = chromadb.PersistentClient(path=Config.VECTOR_DB_DIR)chroma_collection = chroma_client.get_or_create_collection(        name=Config.COLLECTION_NAME,        metadata={ "hnsw:space""cosine" })
print ( "Loading index from chromadb..." )index = VectorStoreIndex.from_vector_store(ChromaVectorStore(chroma_collection=chroma_collection))
#question="What is the maximum probation period of a labor contract?"question= "What is xtuner?"
# Import custom filters, scores below 0.6 will be removedscore_filter = ScoreThresholdFilter(threshold= 0.6 , verbose= True )

# Create a query enginequery_engine = index.as_query_engine(    similarity_top_k=Config.TOP_K,   # Initial vector recall number    text_qa_template=response_template,    node_postprocessors=[reranker,score_filter],  # reranking + filtering low scores)
# Execute the queryresponse = query_engine.query(question)
if  not  response.response  or  response.response.strip() ==  ""  or  response.response.strip().lower() ==  "empty response" :    print ( "No relevant data found" )else :    # Display results    print ( f"\nSmart assistant answers: \n {response.response} " )    print ( "\nSupport basis: " )    for  idx, node  in  enumerate (response.source_nodes,  1 ):        meta = node.metadata        print ( f"\n[ {idx}{meta[ 'full_title' ]} " )        print ( f" Source file: {meta[ 'source_file' ]} " )        print ( f" Law name: {meta[ 'law_name' ]} " )        print ( f" Clause content: {node.text[: 100 ]} ..." )        print ( f" Relevance score: {node.score: .4 f} " )

The execution results are as follows:

7. RAG Process in LlamaIndex

The process is as follows:

1️⃣ User input Query ↓2️⃣ Retriever Recall related documents (vector/BM25/multimodal) ↓3️⃣ NodePostprocessor Process the recall results - Semantic reordering - Deduplication, filtering - Segmentation, merging ↓4️⃣ ResponseSynthesizer Splices context and constructs Prompt - response_mode: compact / refine / tree / summarize ↓5️⃣ LLM Generates answers (based on prompt + document)