BM25: Text Relevance Ranking in RAG

In-depth exploration of the key role and implementation details of the BM25 algorithm in the RAG system.
Core content:
1. Application of the BM25 algorithm in the RAG retrieval stage
2. BM25's method and formula for determining relevance
3. Comparison and advantage analysis between BM25 and TF-IDF
BM25 is a text relevance ranking algorithm.
During the retrieval stage of the RAG system, it can quickly find candidate paragraphs that best match the query from massive documents, providing a more focused context for subsequent generation models.
BM25 can be used independently or in conjunction with dense search to build a more powerful hybrid search engine.
1. What does BM25 do? 2. How does BM25 determine relevance? 2.1 Term Frequency (TF) 2.2 Inverse Document Frequency (IDF) 2.3 Document Length 3. Implementation principle and formula of BM25 4. Example: Calculating BM25 score 4.1 Document Information 4.2 Hypothesis Query For: "Model Algorithm Performance" 4.3 Calculate IDF (using BM25's IDF formula): 4.4 Calculate the TF-IDF score for each document 4.5 Calculate the BM25 score for each document 4.6 Comparative analysis with TF-IDF: 5. Advantages of BM25 6. Using LangChain to implement BM25 retrieval Summarize
-- To receive the learning materials package, see the end of the article
When a user asks a question, RAG (Retrieval-Augmented Generation) first "retrieves" relevant document fragments from massive data, and then provides these retrieval results to the LLM model to generate more accurate and reliable answers.
Information retrieval is the cornerstone of RAG, and to achieve efficient and accurate retrieval, it is necessary to rely on various sorting and matching algorithms.
In the field of information retrieval, there are both modern methods based on vector similarity and traditional methods based on keywords.
TF-IDF (Term Frequency-Inverse Document Frequency) is one of the classic keyword matching algorithms, while BM25 (Best Matching 25) is a more advanced algorithm developed based on TF-IDF.
BM25 has been proven and widely used in the field of information retrieval, and plays an important role in the retrieval stage of RAG.
Related reading: TF-IDF: Understanding the value of “keywords”
1. What does BM25 do?
Okapi BM25 (BM is short for Best Match) is a ranking function used to estimate the relevance of a document to a given search query.
The name of the ranking function is BM25. The full name is Okapi BM25.
Okapi was the name of the first system to use it, the Okapi Information Retrieval System implemented at City University, London in the 1980s and 1990s.
Simply put, BM25 is an algorithm used to calculate the relevance score between "user query" and "document" . The higher the score, the more relevant the document is to the user query.
In the retrieval stage of RAG, when a user enters a query, BM25 calculates the relevance score of the query and each document in the knowledge base , and then selects the documents with the highest scores and gives them to the generation model.
2. How does BM25 determine relevance?
The core idea of BM25 is very similar to how you normally search for information:
2.1 Term Frequency (TF)
If a word appears more times in a document, then the document is likely to be more relevant to that word.
For example, if you search for "iPhone", if the words "Apple" and "phone" appear many times in a document, then it is likely about iPhone.
However, BM25 does not emphasize high word frequency indefinitely, it has a concept of "saturation". Imagine you eat an apple, you are happy when you eat the first one, and you are also happy when you eat the second one, but you may not be so happy when you eat the tenth one.
BM25 believes that the contribution of word frequency will gradually become flat. After reaching a certain number of times, increasing the word frequency will have limited effect on improving relevance.
2.2 Inverse Document Frequency (IDF)
The fewer documents a word appears in across the entire document collection , the more it distinguishes the topic of the document, and therefore the higher its importance.
For example, words such as "的", "是", and "在" appear in almost all documents. They are not helpful in determining the topic of the document, so their IDF values are very low.
But words like "BM25", "RAG", and "Transformer" only appear in documents in specific fields, and their IDF values are very high, which further proves that the documents are about these topics.
2.3 Document Length
BM25 takes the length of the document into consideration. A news article with only 100 words mentions "artificial intelligence" twice, while an encyclopedia with 100,000 words mentions "artificial intelligence" twice. Obviously, the former is more likely to focus on "artificial intelligence".
BM25 tends to believe that a word in a short document is more relevant than in a long document . It uses a mechanism to "penalize" overly long documents to prevent them from getting high scores simply because they have many words.
BM25 combines these three main factors and uses a formula to calculate the final relevance score. There are also some parameters in the formula (such asand), you can adjust the influence of word frequency saturation and document length.
3. Implementation principle and formula of BM25
Now let’s look at how BM25 implements these ideas through formulas.
BM25 calculation query and Documentation Correlation score between The basic idea is to query Each word in The correlation contributions are added together.
The formula can slide left and right
For each word in the query , which documents The contribution score is determined by the following:
word In the documentation Term Frequency in:
word Inverse Document Frequency:
This is a measuring word The rarity of a document in the entire document collection. The calculation formula is usually:
is the total number of documents in the document collection ,
Contains words The number of documents .
Here Functions and This is to smooth the process and avoid the occurrence of 0 orA problem occurred.
The higher the IDF value, the rarer and more important the word is.
Word frequency saturation part:
This part is used to deal with the saturation problem of word frequency. The formula is:
Word frequency: Yes word In the documentation The frequency of words in ;
Saturation parameters: is a positive parameter that controls the effect of term frequency (TF) on document scores and is usually set between arrive Usually better results are obtained.
Document length normalization (penalize long documents) : :
This part is used to deal with the impact of document length.
In the formula Representation Document LengthThe average document length of the entire document collection ratio.
It uses a parameter To control the degree of document length normalization. Usually the value is arrive Between. When When , document length normalization is not performed; when When , complete document length normalization is performed;
After disassembling, you will find that the structure of BM25 is three parts:
IDF: Screen out useless high-frequency words TF saturation: Limit the bonus of keyword stacking Document length normalization: short articles are prioritized, long articles are downgraded
The entire formula is imitating our human intuition of "finding keywords", "determining the key points", and "disliking too many useless words", but it has been transformed into mathematical language.
4. Example: Calculating BM25 score
To better illustrate the advantages of BM25 over TF-IDF in dealing with document length and term frequency saturation, we use a document collection containing 3 documents for demonstration.
For simplicity, we assume that the texts have been tokenized.
Document A: Machine Learning Model Training Algorithm Performance (Length 5)
Document B: Deep Learning Model Neural Network Training Big Data Computing Power Optimization Performance (Length 8)
Document C: Algorithm Efficiency Optimization Performance (Length 4)
4.1 Document Information
Total number of documents
Document A Length
Document B Length
Document C Length
Average document length
4.2 Hypothesis Query For: "Model Algorithm Performance"
The words in the query are "Model", "algorithm", "performance".
4.3 Calculate IDF (using BM25's IDF formula):
The word "model" appears in both document A and document B.
The word "algorithm" appears in document A and document C.
The word "performance" appears in document A, document B, and document C.
4.4 Calculate the TF-IDF score for each document
We use the standard TF-IDF formula:
Term Frequency (TF):
TF-IDF score calculation:
Document A:
Document B:
Document C:
TF-IDF ranking results:
Based on the TF-IDF scores, the documents are ranked as follows:
Document A (0.998) > Document B (0.528) = Document C (0.528)
In this simple TF-IDF calculation, document B and document C have the same score because they both contain the two terms in the query and have the same IDF value for those terms. TF-IDF in this form does not take into account the effect of document length on relevance.
4.5 Calculate the BM25 score for each document
Assumptions ,,
First, calculate the length normalization term for each document:
Document A (length 5):
Document B (length 8):
Document C (length 4):
Then use the BM25 formula to calculate the contribution of each word in the document and sum them up:
Document A (D_A, soft length normalization term ≈ 0.91):
The word “model” (TF=1, IDF≈0.47):
The word “algorithm” (TF=1, IDF≈0.47):
The word “performance” (TF=1, IDF≈0.058):
Document B (D_B, soft length normalization term ≈ 1.31):
The word “model” (TF=1, IDF≈0.47):
Word "algorithm" (TF=0): 0
The word “performance” (TF=1, IDF≈0.058):
Document C (D_C, soft length normalization term ≈ 0.78):
Word "model" (TF=0): 0
The word “algorithm” (TF=1, IDF≈0.47):
The word “performance” (TF=1, IDF≈0.058):
BM25 ranking results:
According to BM25 score, the documents are ranked as follows:
Document A (1.055) > Document C (0.609) > Document B (0.445)
4.6 Comparative analysis with TF-IDF:
With this new example, we can more clearly see the difference in ranking between BM25 and TF-IDF:
TF-IDF ranking: Document A (0.998) > Document B (0.528) = Document C (0.528)
BM25 ranking: Document A (1.055) > Document C (0.609) > Document B (0.445)
Although both Document B and Document C contain the two terms in the query and the IDF values of these terms are the same, BM25 gives Document C a higher score and ranks it ahead of Document B. This is because:
Document length normalization: Document C (length 4) is shorter than the average document length (≈ 5.67), and the soft length normalization term of BM25 (≈ 0.78) is less than 1, which will increase the contribution of the words in document C. Document B (length 8) is longer than the average document length, and the soft length normalization term (≈ 1.31) is greater than 1, which will reduce the contribution of the words in document B.
This example effectively demonstrates how BM25 adjusts the relevance score by normalizing the document length, making its ranking results more consistent with the intuition that "short documents that contain query terms may be more focused on the topic", thereby producing more accurate rankings in information retrieval, which is crucial for the RAG system to recall high-quality relevant documents.
5. Advantages of BM25
Both TF-IDF and BM25 are text retrieval algorithms based on the idea of term frequency (TF) and inverse document frequency (IDF). However, in RAG scenarios, BM25 is more commonly used because it solves some major limitations of TF-IDF in actual information retrieval and provides better ranking results.
BM25 can be seen as an optimized version of TF-IDF, which performs better in TF saturation, document length normalization, and IDF smoothing.
In addition to traditional keyword-based retrieval methods (TF-IDF and BM25), many RAG systems now use vector-based dense retrieval. However, in some practical scenarios, BM25 is still very competitive, especially in the following tasks:
Tasks with short text and clear terms (such as product Q&A): The keyword signal itself is already strong and does not require complex semantic modeling; When the dataset is not large or the embedding training is not sufficient , BM25 can often provide a more stable basic retrieval effect; As part of hybrid retrieval : Many high-performance RAG systems are actually a combination of "sparse + dense" - using BM25 to provide the accuracy of keyword recall, and then using vector recall to supplement semantic coverage, and finally integrating re-ranking to improve the overall retrieval effect.
So BM25 is very useful in some tasks, and it can also be used as the first step in building complex multi-stage retrieval systems.
6. Using LangChain to implement BM25 retrieval
LangChain provides a built-in BM25 implementation. Here are the basic steps to implement BM25 retrieval using LangChain:
# Install necessary libraries
# pip install jieba rank_bm25
from langchain.schema import Document
from langchain_community.retrievers import BM25Retriever
import jieba
def chinese_tokenizer (text) :
return list(jieba.cut(text))
# Sample document content
doc_contents = [
"Artificial intelligence technology is developing rapidly. Machine learning is an important branch of artificial intelligence . "
"Natural language processing is another important branch of artificial intelligence. Text analysis is a common application of natural language processing . "
"Technological development has brought many new application areas" ,
"This is a bonus document on Natural Language Processing that covers more text analysis."
]
# Convert the document content to a list of LangChain Document objects
documents = [Document(page_content=content) for content in doc_contents]
# Create a BM25 retriever
# By default, k=4, which means the 4 most relevant documents are returned
# The default word segmenter of BM25Retriever is for English. You need to pass in the Chinese word segmenter to avoid inaccurate search sorting caused by word segmentation errors.
retriever = BM25Retriever.from_documents(documents, k= 2 , preprocess_func=chinese_tokenizer)
# Define the query
query = "Application of artificial intelligence technology"
# Perform a search
relevant_docs = retriever.invoke(query)
# Print search results
print( f"For query ' {query} ', relevant documents retrieved:" )
for i, doc in enumerate(relevant_docs):
print( f"\n--- document {i + 1 } ---" )
print(doc.page_content)
Summarize
In general, BM25 is a text relevance ranking algorithm that combines simplicity, interpretability, and efficiency .
During the retrieval stage of the RAG system, it can quickly find candidate paragraphs that best match the query from massive documents without relying on training, providing a more focused context for subsequent generation models.
BM25 can be used independently or combined with dense search to build a stronger hybrid searcher.