Woter AI detection.Hurry - ends Jul 19th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

Basic use of BM25 algorithm and its application in RAG

Written by

Jasper Cole

Updated on:July-15th-2025

Overview

The BM25 algorithm is mentioned or used in vector queries in various RAG open source frameworks and vector libraries. This article introduces the basic principles of the BM25 algorithm and uses a runnable example to illustrate how to use the BM25 algorithm.

BM25 (Best Match 25) is a probabilistic model for information retrieval, commonly used in search engines and question-answering systems. In the RAG (Retrieval-Augmented Generation) framework, BM25 is used to retrieve the most relevant documents to a query from a large-scale document library.

BM25 calculates the relevance of documents to queries based on term frequency (TF) and inverse document frequency (IDF), and combines document length normalization to effectively process documents of different lengths.

Basic use of BM25

Below is a complete example of implementing the BM25 algorithm in Python. We will userank_bm25This is a commonly used BM25 implementation library that can easily calculate the relevance of documents to queries.

Install dependent libraries

pip install rank-bm25

Writing test code

The basic logic of the code is as follows:

(1) Perform word segmentation on each document and question separately

(2) Initialize the BM25 algorithm using the document list after word segmentation

(3) Compare the similarity between the query and the word list

(4) Find the document with the highest relevance score from the document list. This document is the document we are looking for.

The code is implemented as follows:

from rank_bm25 import BM25Okapi    
import jieba # Used for Chinese word segmentation    

# Example document library 
documents =  [  "Cats are cute animals that like to catch mice." , "Dogs are good friends of humans and like to chase cats." , "Mice are small rodents that cats like to catch." ] print ( documents ) # User query query = "What animals do cats like to catch?" print ( "Question: " + query ) # Word segmentation for documents and queries def tokenize ( text ): return list ( jieba . cut ( text )) # Document library after word segmentation tokenized_documents =  [ tokenize ( doc )  for doc in documents ] # Query after word segmentation tokenized_query = tokenize ( query ) # Initialize BM25 model bm25 = BM25Okapi ( tokenized_documents ) # Calculate the relevance score between query and document doc_scores = bm25 . get_scores ( tokenized_query ) # Print the score of each document for i ,  score in enumerate ( doc_scores ): print ( f"  BM25 score of  document { i + 1 } : { score } " ) # Find the most relevant document most_relevant_doc_index = doc_scores . argmax () print ( f"\nThe most relevant document is document { most_relevant_doc_index + 1 } :  { documents [ most_relevant_doc_index ]} " )

Code Description

Participle :

usejiebaThe library performs word segmentation on Chinese documents and queries.
For example, the word segmentation result of document 1 is:['Cat', 'is', 'a', 'cute', 'type', 'animal', 'likes', 'catch', 'mice', '.']

Initialize the BM25 model :

useBM25OkapiThe class initializes the BM25 model and passes in the document library after word segmentation.

Calculate the relevance score :

useget_scoresThe method calculates the relevance score of the query to each document.

Sorting and output :

Sort by score to find the most relevant documents.

Result Output

[ 'Cats are cute animals that like to catch mice. ' ,  'Dogs are good friends of humans and like to chase cats. ' ,  'Mice are small rodents that cats like to catch. ' ] 
Question :  What animals do cats like to catch? 
Building prefix dict from the default dictionary ...        
Loading model from cache / tmp / jieba . cache     
Loading model cost 0.705 seconds .     
Prefix dict has been built successfully .      BM25 score for 
document 1 :  0.3001762708496166    BM25 score for 
document 2 :  - 0.07080064278072501    BM25 score for 
document 3 :  - 0.2035654844820229   

The most relevant document is Document 1 :  Cats are cute animals that like to catch mice.

Application of BM25 in RAG

In the RAG (Retrieval-Augmented Generation) framework, a hybrid retrieval method that combines vector retrieval (such as embedding-based similarity search) and BM25 keyword retrieval can significantly improve the accuracy of queries. The core idea of this hybrid retrieval is to use the complementarity of the two retrieval methods to more comprehensively capture the relevance between queries and documents.

Advantages and disadvantages of vector search and BM25 search

Vector retrieval (embedding-based similarity search)

advantage :

Capable of capturing semantic similarity even when the query and document do not have exact matching keywords.
Suitable for handling complex semantic relationships, such as synonyms, contextual correlation, etc.

shortcoming :

The ability to capture out-of-domain or rare terms is weak.
Requires high-quality embedding models (such as BERT, Sentence-BERT) and a lot of computing resources.

BM25 keyword search

advantage :

It works very well for exact keyword matching and is suitable for handling clear terms and short queries.
High computational efficiency, suitable for large-scale document libraries.

shortcoming :

It cannot capture semantic similarity and is insensitive to synonyms and contextual relevance.
Less effective for long or complex queries.

The principle of hybrid search

The core principle of hybrid search is to combine the advantages of vector search and BM25 search , and generate the final search results through weighted or sorted fusion. The specific steps are as follows:

(1) Calculate the scores of vector search and BM25 search respectively

Vector retrieval : Use an embedding model such as BERT to convert queries and documents into vectors, and then calculate cosine similarity as the score.
BM25 retrieval : Use the BM25 algorithm to calculate the keyword matching score between the query and the document.

(2) Score normalization

Since the score ranges of vector retrieval and BM25 retrieval may be different, the scores need to be normalized to make them comparable on the same dimension.

(3) Weighted fusion

The scores of the two retrieval methods are weighted and combined to generate the final retrieval score. For example:

Final Score=α⋅BM25 Score+(1−α)⋅Vector ScoreFinal Score= α ⋅BM25 Score+(1− α )⋅Vector Score

Here, α α is a weight parameter used to adjust the relative importance of BM25 and vector retrieval.

(4) Sorting and returning results

The documents are sorted according to the final score, returning the most relevant documents.

Hybrid retrieval strategies in practical applications

In practical applications, hybrid retrieval can be achieved through the following strategies:

(1) Weighted summation

The scores of vector search and BM25 search are added together according to certain weights to generate the final score. For example:

Final Score=0.5⋅BM25 Score+0.5⋅Vector ScoreFinal Score=0.5⋅BM25 Score+0.5⋅Vector Score

(2) Sorting Fusion

Sort the results of vector search and BM25 search respectively, and then merge the sorted results by weighting or interpolation. For example:

Take the first 10 results of vector search and the first 10 results of BM25 search, merge them and re-sort them.

(3) Threshold filtering

First, use BM25 search to filter out documents that match the keywords, and then use vector search to semantically sort the filtered documents.

Code implementation of hybrid retrieval

After installing the embedding model through the local ollama, you can use the embedding model to generate embedding vectors. The following code calculates the embedding vector of each document block through the locally deployed embedding model, and then generates the embedding vector according to the mixture.

from rank_bm25 import BM25Okapi    
from sklearn . metrics . pairwise import cosine_similarity    
import numpy as np    
import jieba  
import requests # Used to send HTTP requests    

# Sample document library 
documents =  [  "Cats are cute animals and like to catch mice." , "Dogs are good friends of humans and like to chase cats." , "Mice are small rodents that cats like to catch." ] # User query query = "What animals do cats like to catch?" # Word segmentation function def tokenize ( text ): return list ( jieba . cut ( text )) # Document library after word segmentation tokenized_documents =  [ tokenize ( doc )  for doc in documents ] # Initialize BM25 model bm25 = BM25Okapi ( tokenized_documents ) # Calculate BM25 score bm25_scores = bm25 . get_scores ( tokenize ( query )) # Embedding model API address EMBEDDING_MODEL_URL = "http://172.16.1.54:11434/api/embeddings" # Function to get the embedding vector def get_embedding ( text ): # Construct request data payload =  { "model" :  "unclemusclez/jina-embeddings-v2-base-code:latest" , "prompt" :  text     } # Send POST request response = requests . post ( EMBEDDING_MODEL_URL ,  json = payload ) if response . status_code == 200 : # Parse the returned embedding vector embedding = response . json (). get ( "embedding" ) return np . array ( embedding ) else : raise Exception ( f"Failed to get embedding:  { response . status_code }  -  { response . text } " )
    
    
    



  


 
     


    

  

  

  



 
    
     
        
        

    
      
       
        
          
         
    
         

# Get the embedding vectors of the query and document 
query_embedding = get_embedding ( query )   
document_embeddings =  [ get_embedding ( doc )  for doc in documents ]     
# Calculate cosine similarity 
vector_scores = cosine_similarity ([ query_embedding ],  document_embeddings )[ 0 ]   

# Normalized scores 
bm25_scores_normalized =  ( bm25_scores - np . min ( bm25_scores ))  /  ( np . max ( bm25_scores )  - np . min ( bm25_scores ))     
vector_scores_normalized =  ( vector_scores - np . min ( vector_scores ))  /  ( np . max ( vector_scores )  - np . min ( vector_scores ))     

# Weighted fusion 
alpha = 0.5   
final_scores = alpha * bm25_scores_normalized +  ( 1 - alpha )  * vector_scores_normalized         

# Sort and output the results 
sorted_indices = np . argsort ( final_scores )[:: - 1 ]   
for i in sorted_indices :    print ( f"Mixed score of document { i + 1 }  :  { final_scores [ i ] }  -  { documents [ i ] } " )

Output:

Building prefix dict from the default dictionary ...        
Loading model from cache / tmp / jieba . cache     
Loading model cost 0.715 seconds .     
Prefix dict has been built successfully .Mixed      score for 
Document 1 :  1.0 - A cat is a cute animal that likes to catch mice.    Mixed score for 
Document 2 :  0.5282365628163656 - A dog is a good friend of humans and likes to chase cats.    Mixed score for 
Document 3 :  0.0 - A mouse is a small rodent that cats like to catch.

Note: The scores here are based on my embedding model, and different models may be different.

Summarize

By combining vector retrieval and BM25 retrieval, the hybrid retrieval method can capture semantic information and keyword matching information at the same time, thereby improving the query accuracy of the RAG framework. This method has significant advantages in processing diverse queries and reducing missed detections and false detections, and is an effective strategy to improve the performance of retrieval enhancement generation systems.