BM25: Text Relevance Ranking in RAG

Written by
Silas Grey
Updated on:June-20th-2025
Recommendation

In-depth exploration of the key role and implementation details of the BM25 algorithm in the RAG system.

Core content:
1. Application of the BM25 algorithm in the RAG retrieval stage
2. BM25's method and formula for determining relevance
3. Comparison and advantage analysis between BM25 and TF-IDF

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)


-text-

BM25 is a text relevance ranking algorithm.

During the retrieval stage of the RAG system, it can quickly find candidate paragraphs that best match the query from massive documents, providing a more focused context for subsequent generation models.

BM25 can be used independently or in conjunction with dense search to build a more powerful hybrid search engine.

  • 1. What does BM25 do?
  • 2. How does BM25 determine relevance?
    • 2.1 Term Frequency (TF)
    • 2.2 Inverse Document Frequency (IDF)
    • 2.3 Document Length
  • 3. Implementation principle and formula of BM25
  • 4. Example: Calculating BM25 score
    • 4.1 Document Information
    • 4.2 Hypothesis Query  For: "Model Algorithm Performance"
    • 4.3 Calculate IDF (using BM25's IDF formula):
    • 4.4 Calculate the TF-IDF score for each document
    • 4.5 Calculate the BM25 score for each document
    • 4.6 Comparative analysis with TF-IDF:
  • 5. Advantages of BM25
  • 6. Using LangChain to implement BM25 retrieval
  • Summarize

-- To receive the learning materials package, see the end of the article

When a user asks a question, RAG (Retrieval-Augmented Generation) first "retrieves" relevant document fragments from massive data, and then provides these retrieval results to the LLM model to generate more accurate and reliable answers.

Information retrieval is the cornerstone of RAG, and to achieve efficient and accurate retrieval, it is necessary to rely on various sorting and matching algorithms.

In the field of information retrieval, there are both modern methods based on vector similarity and traditional methods based on keywords.

TF-IDF (Term Frequency-Inverse Document Frequency) is one of the classic keyword matching algorithms, while BM25 (Best Matching 25) is a more advanced algorithm developed based on TF-IDF.

BM25 has been proven and widely used in the field of information retrieval, and plays an important role in the retrieval stage of RAG.

Related reading: TF-IDF: Understanding the value of “keywords”

1. What does BM25 do?

Okapi BM25 (BM is short for Best Match) is a ranking function used to estimate the relevance of a document to a given search query.

The name of the ranking function is BM25. The full name is Okapi BM25.

Okapi was the name of the first system to use it, the Okapi Information Retrieval System implemented at City University, London in the 1980s and 1990s.

Simply put, BM25 is an algorithm used to calculate the relevance score between "user query" and "document" . The higher the score, the more relevant the document is to the user query.

In the retrieval stage of RAG, when a user enters a query, BM25 calculates the relevance score of the query and each document in the knowledge base , and then selects the documents with the highest scores and gives them to the generation model.

2. How does BM25 determine relevance?

The core idea of ​​BM25 is very similar to how you normally search for information:

2.1 Term Frequency (TF)

If a word appears more times in a document, then the document is likely to be more relevant to that word.

For example, if you search for "iPhone", if the words "Apple" and "phone" appear many times in a document, then it is likely about iPhone.

However, BM25 does not emphasize high word frequency indefinitely, it has a concept of "saturation".  Imagine you eat an apple, you are happy when you eat the first one, and you are also happy when you eat the second one, but you may not be so happy when you eat the tenth one.

BM25 believes that the contribution of word frequency will gradually become flat. After reaching a certain number of times, increasing the word frequency will have limited effect on improving relevance.

2.2 Inverse Document Frequency (IDF)

The fewer documents a word appears in across the entire document collection , the more it distinguishes the topic of the document, and therefore the higher its importance.

For example, words such as "的", "是", and "在" appear in almost all documents. They are not helpful in determining the topic of the document, so their IDF values ​​are very low.

But words like "BM25", "RAG", and "Transformer" only appear in documents in specific fields, and their IDF values ​​are very high, which further proves that the documents are about these topics.

2.3 Document Length

BM25 takes the length of the document into consideration. A news article with only 100 words mentions "artificial intelligence" twice, while an encyclopedia with 100,000 words mentions "artificial intelligence" twice. Obviously, the former is more likely to focus on "artificial intelligence".

BM25 tends to believe that a word in a short document is more relevant than in a long document . It uses a mechanism to "penalize" overly long documents to prevent them from getting high scores simply because they have many words.

BM25 combines these three main factors and uses a formula to calculate the final relevance score. There are also some parameters in the formula (such asand), you can adjust the influence of word frequency saturation and document length.

3. Implementation principle and formula of BM25

Now let’s look at how BM25 implements these ideas through formulas.

BM25 calculation query  and Documentation  Correlation score between  The basic idea is to query  Each word in  The correlation contributions are added together.

The formula can slide left and right

For each word  in the query  , which documents  The contribution score is determined by the following:

  1. word  In the documentation  Term Frequency in:

  2. word  Inverse Document Frequency:

    This is a measuring word  The rarity of a document in the entire document collection. The calculation formula is usually:

  • is the total number of documents  in the document collection ,

  •  Contains words The number  of documents .

  • Here  Functions and  This is to smooth the process and avoid the occurrence of 0 orA problem occurred.

The higher the IDF value, the rarer and more important the word is.

  1. Word frequency saturation part:

    This part is used to deal with the saturation problem of word frequency. The formula is:

  • Word frequency: Yes word  In the documentation  The frequency of words in ;

  • Saturation parameters: is a positive parameter that controls the effect of term frequency (TF) on document scores and is usually set between  arrive  Usually better results are obtained.

  • Document length normalization (penalize long documents) : :

    This part is used to deal with the impact of document length.

    In the formula  Representation Document  LengthThe average document length  of the entire document collection ratio.

    It uses a parameter  To control the degree of document length normalization.  Usually the value is  arrive  Between. When  When , document length normalization is not performed; when  When , complete document length normalization is performed;

After disassembling, you will find that the structure of BM25 is three parts:

  • IDF: Screen out useless high-frequency words
  • TF saturation: Limit the bonus of keyword stacking
  • Document length normalization: short articles are prioritized, long articles are downgraded

The entire formula is imitating our human intuition of "finding keywords", "determining the key points", and "disliking too many useless words", but it has been transformed into mathematical language.

4. Example: Calculating BM25 score

To better illustrate the advantages of BM25 over TF-IDF in dealing with document length and term frequency saturation, we use a document collection containing 3 documents for demonstration.

For simplicity, we assume that the texts have been tokenized.

  • Document A:  Machine Learning Model Training Algorithm Performance (Length 5)

  • Document B:  Deep Learning Model Neural Network Training Big Data Computing Power Optimization Performance (Length 8)

  • Document C:  Algorithm Efficiency Optimization Performance (Length 4)

4.1 Document Information

  • Total number of documents 

  • Document A Length 

  • Document B Length 

  • Document C Length 

  • Average document length 

4.2 Hypothesis Query  For: "Model Algorithm Performance"

The words in the query are  "Model",  "algorithm",  "performance".

4.3 Calculate IDF (using BM25's IDF formula):

  • The word "model" appears in both document A and document B.

  • The word "algorithm" appears in document A and document C.

  • The word "performance" appears in document A, document B, and document C.

4.4 Calculate the TF-IDF score for each document

We use the standard TF-IDF formula:

Term Frequency (TF):

TF-IDF score calculation:

  • Document A:

  • Document B:

  • Document C:

TF-IDF ranking results:

Based on the TF-IDF scores, the documents are ranked as follows:

Document A (0.998) > Document B (0.528) = Document C (0.528)

In this simple TF-IDF calculation, document B and document C have the same score because they both contain the two terms in the query and have the same IDF value for those terms. TF-IDF in this form does not take into account the effect of document length on relevance.

4.5 Calculate the BM25 score for each document

Assumptions ,,

First, calculate the length normalization term for each document:

  • Document A (length 5): 

  • Document B (length 8): 

  • Document C (length 4): 

Then use the BM25 formula to calculate the contribution of each word in the document and sum them up:

Document A (D_A, soft length normalization term ≈ 0.91):

  • The word “model” (TF=1, IDF≈0.47): 

  • The word “algorithm” (TF=1, IDF≈0.47): 

  • The word “performance” (TF=1, IDF≈0.058): 

Document B (D_B, soft length normalization term ≈ 1.31):

  • The word “model” (TF=1, IDF≈0.47): 

  • Word "algorithm" (TF=0): 0

  • The word “performance” (TF=1, IDF≈0.058): 

Document C (D_C, soft length normalization term ≈ 0.78):

  • Word "model" (TF=0): 0

  • The word “algorithm” (TF=1, IDF≈0.47): 

  • The word “performance” (TF=1, IDF≈0.058): 

BM25 ranking results:

According to BM25 score, the documents are ranked as follows:

Document A (1.055) > Document C (0.609) > Document B (0.445)

4.6 Comparative analysis with TF-IDF:

With this new example, we can more clearly see the difference in ranking between BM25 and TF-IDF:

  • TF-IDF ranking:  Document A (0.998) > Document B (0.528) = Document C (0.528)

  • BM25 ranking:  Document A (1.055) > Document C (0.609) > Document B (0.445)

Although both Document B and Document C contain the two terms in the query and the IDF values ​​of these terms are the same, BM25 gives Document C a higher score and ranks it ahead of Document B. This is because:

Document length normalization:  Document C (length 4) is shorter than the average document length (≈ 5.67), and the soft length normalization term of BM25 (≈ 0.78) is less than 1, which will increase the contribution of the words in document C. Document B (length 8) is longer than the average document length, and the soft length normalization term (≈ 1.31) is greater than 1, which will reduce the contribution of the words in document B.

This example effectively demonstrates how BM25 adjusts the relevance score by normalizing the document length, making its ranking results more consistent with the intuition that "short documents that contain query terms may be more focused on the topic", thereby producing more accurate rankings in information retrieval, which is crucial for the RAG system to recall high-quality relevant documents.

5. Advantages of BM25

Both TF-IDF and BM25 are text retrieval algorithms based on the idea of ​​term frequency (TF) and inverse document frequency (IDF). However, in RAG scenarios, BM25 is more commonly used because it solves some major limitations of TF-IDF in actual information retrieval and provides better ranking results.

BM25 can be seen as an optimized version of TF-IDF, which performs better in TF saturation, document length normalization, and IDF smoothing.

In addition to traditional keyword-based retrieval methods (TF-IDF and BM25), many RAG systems now use vector-based dense retrieval. However, in some practical scenarios, BM25 is still very competitive, especially in the following tasks:

  • Tasks with short text and clear terms (such as product Q&A): The keyword signal itself is already strong and does not require complex semantic modeling;
  • When the dataset is not large or the embedding training is not sufficient , BM25 can often provide a more stable basic retrieval effect;
  • As part of hybrid retrieval : Many high-performance RAG systems are actually a combination of "sparse + dense" - using BM25 to provide the accuracy of keyword recall, and then using vector recall to supplement semantic coverage, and finally integrating re-ranking to improve the overall retrieval effect.

So BM25 is very useful in some tasks, and it can also be used as the first step in building complex multi-stage retrieval systems.

6. Using LangChain to implement BM25 retrieval

LangChain provides a built-in BM25 implementation. Here are the basic steps to implement BM25 retrieval using LangChain:

# Install necessary libraries
# pip install jieba rank_bm25


from  langchain.schema  import  Document
from  langchain_community.retrievers  import  BM25Retriever

import  jieba


def chinese_tokenizer (text) : 
    return  list(jieba.cut(text))


# Sample document content
doc_contents = [
    "Artificial intelligence technology is developing rapidly. Machine learning is an important branch of artificial intelligence . "
    "Natural language processing is another important branch of artificial intelligence. Text analysis is a common application of natural language processing . "
    "Technological development has brought many new application areas" ,
    "This is a bonus document on Natural Language Processing that covers more text analysis."
]

# Convert the document content to a list of LangChain Document objects
documents = [Document(page_content=content)  for  content  in  doc_contents]

# Create a BM25 retriever
# By default, k=4, which means the 4 most relevant documents are returned
# The default word segmenter of BM25Retriever is for English. You need to pass in the Chinese word segmenter to avoid inaccurate search sorting caused by word segmentation errors.
retriever = BM25Retriever.from_documents(documents, k= 2 , preprocess_func=chinese_tokenizer)

# Define the query
query =  "Application of artificial intelligence technology"

# Perform a search
relevant_docs = retriever.invoke(query)

# Print search results
print( f"For query ' {query} ', relevant documents retrieved:" )
for  i, doc  in  enumerate(relevant_docs):
    print( f"\n--- document  {i +  1 }  ---" )
    print(doc.page_content)

Summarize

In general, BM25 is a text relevance ranking algorithm that combines simplicity, interpretability, and efficiency .

During the retrieval stage of the RAG system, it can quickly find candidate paragraphs that best match the query from massive documents without relying on training, providing a more focused context for subsequent generation models.

BM25 can be used independently or combined with dense search to build a stronger hybrid searcher.