Basic use of BM25 algorithm and its application in RAG

Master the BM25 algorithm to improve the performance of information retrieval and question answering systems.
Core content:
1. The principle of the BM25 algorithm and its application in the RAG framework
2. Detailed steps to implement the BM25 algorithm in Python
3. Use the rank_bm25 library to calculate the relevance of documents and queries
The BM25 algorithm is mentioned or used in vector queries in various RAG open source frameworks and vector libraries. This article introduces the basic principles of the BM25 algorithm and uses a runnable example to illustrate how to use the BM25 algorithm.
BM25 (Best Match 25) is a probabilistic model for information retrieval, commonly used in search engines and question-answering systems. In the RAG (Retrieval-Augmented Generation) framework, BM25 is used to retrieve the most relevant documents to a query from a large-scale document library.
BM25 calculates the relevance of documents to queries based on term frequency (TF) and inverse document frequency (IDF), and combines document length normalization to effectively process documents of different lengths.
Basic use of BM25
Below is a complete example of implementing the BM25 algorithm in Python. We will userank_bm25
This is a commonly used BM25 implementation library that can easily calculate the relevance of documents to queries.
Install dependent libraries
pip install rank-bm25
Writing test code
The basic logic of the code is as follows:
(1) Perform word segmentation on each document and question separately
(2) Initialize the BM25 algorithm using the document list after word segmentation
(3) Compare the similarity between the query and the word list
(4) Find the document with the highest relevance score from the document list. This document is the document we are looking for.
The code is implemented as follows:
from rank_bm25 import BM25Okapi
import jieba # Used for Chinese word segmentation
# Example document library
documents = [ "Cats are cute animals that like to catch mice." , "Dogs are good friends of humans and like to chase cats." , "Mice are small rodents that cats like to catch." ] print ( documents ) # User query query = "What animals do cats like to catch?" print ( "Question: " + query ) # Word segmentation for documents and queries def tokenize ( text ): return list ( jieba . cut ( text )) # Document library after word segmentation tokenized_documents = [ tokenize ( doc ) for doc in documents ] # Query after word segmentation tokenized_query = tokenize ( query ) # Initialize BM25 model bm25 = BM25Okapi ( tokenized_documents ) # Calculate the relevance score between query and document doc_scores = bm25 . get_scores ( tokenized_query ) # Print the score of each document for i , score in enumerate ( doc_scores ): print ( f" BM25 score of document { i + 1 } : { score } " ) # Find the most relevant document most_relevant_doc_index = doc_scores . argmax () print ( f"\nThe most relevant document is document { most_relevant_doc_index + 1 } : { documents [ most_relevant_doc_index ]} " )
Code Description
Participle :
use
jieba
The library performs word segmentation on Chinese documents and queries.For example, the word segmentation result of document 1 is:
['Cat', 'is', 'a', 'cute', 'type', 'animal', 'likes', 'catch', 'mice', '.']
Initialize the BM25 model :
use
BM25Okapi
The class initializes the BM25 model and passes in the document library after word segmentation.Calculate the relevance score :
use
get_scores
The method calculates the relevance score of the query to each document.Sorting and output :
Sort by score to find the most relevant documents.
Result Output
[ 'Cats are cute animals that like to catch mice. ' , 'Dogs are good friends of humans and like to chase cats. ' , 'Mice are small rodents that cats like to catch. ' ]
Question : What animals do cats like to catch?
Building prefix dict from the default dictionary ...
Loading model from cache / tmp / jieba . cache
Loading model cost 0.705 seconds .
Prefix dict has been built successfully . BM25 score for
document 1 : 0.3001762708496166 BM25 score for
document 2 : - 0.07080064278072501 BM25 score for
document 3 : - 0.2035654844820229
The most relevant document is Document 1 : Cats are cute animals that like to catch mice.
Application of BM25 in RAG
In the RAG (Retrieval-Augmented Generation) framework, a hybrid retrieval method that combines vector retrieval (such as embedding-based similarity search) and BM25 keyword retrieval can significantly improve the accuracy of queries. The core idea of this hybrid retrieval is to use the complementarity of the two retrieval methods to more comprehensively capture the relevance between queries and documents.
Advantages and disadvantages of vector search and BM25 search
Vector retrieval (embedding-based similarity search)
advantage :
Capable of capturing semantic similarity even when the query and document do not have exact matching keywords.
Suitable for handling complex semantic relationships, such as synonyms, contextual correlation, etc.
shortcoming :
The ability to capture out-of-domain or rare terms is weak.
Requires high-quality embedding models (such as BERT, Sentence-BERT) and a lot of computing resources.
BM25 keyword search
advantage :
It works very well for exact keyword matching and is suitable for handling clear terms and short queries.
High computational efficiency, suitable for large-scale document libraries.
shortcoming :
It cannot capture semantic similarity and is insensitive to synonyms and contextual relevance.
Less effective for long or complex queries.
The principle of hybrid search
The core principle of hybrid search is to combine the advantages of vector search and BM25 search , and generate the final search results through weighted or sorted fusion. The specific steps are as follows:
(1) Calculate the scores of vector search and BM25 search respectively
Vector retrieval : Use an embedding model such as BERT to convert queries and documents into vectors, and then calculate cosine similarity as the score.
BM25 retrieval : Use the BM25 algorithm to calculate the keyword matching score between the query and the document.
(2) Score normalization
Since the score ranges of vector retrieval and BM25 retrieval may be different, the scores need to be normalized to make them comparable on the same dimension.
(3) Weighted fusion
The scores of the two retrieval methods are weighted and combined to generate the final retrieval score. For example:
Final Score=α⋅BM25 Score+(1−α)⋅Vector ScoreFinal Score= α ⋅BM25 Score+(1− α )⋅Vector Score
Here, α α is a weight parameter used to adjust the relative importance of BM25 and vector retrieval.
(4) Sorting and returning results
The documents are sorted according to the final score, returning the most relevant documents.
Hybrid retrieval strategies in practical applications
In practical applications, hybrid retrieval can be achieved through the following strategies:
(1) Weighted summation
The scores of vector search and BM25 search are added together according to certain weights to generate the final score. For example:
Final Score=0.5⋅BM25 Score+0.5⋅Vector ScoreFinal Score=0.5⋅BM25 Score+0.5⋅Vector Score
(2) Sorting Fusion
Sort the results of vector search and BM25 search respectively, and then merge the sorted results by weighting or interpolation. For example:
Take the first 10 results of vector search and the first 10 results of BM25 search, merge them and re-sort them.
(3) Threshold filtering
First, use BM25 search to filter out documents that match the keywords, and then use vector search to semantically sort the filtered documents.
Code implementation of hybrid retrieval
After installing the embedding model through the local ollama, you can use the embedding model to generate embedding vectors. The following code calculates the embedding vector of each document block through the locally deployed embedding model, and then generates the embedding vector according to the mixture.
from rank_bm25 import BM25Okapi
from sklearn . metrics . pairwise import cosine_similarity
import numpy as np
import jieba
import requests # Used to send HTTP requests
# Sample document library
documents = [ "Cats are cute animals and like to catch mice." , "Dogs are good friends of humans and like to chase cats." , "Mice are small rodents that cats like to catch." ] # User query query = "What animals do cats like to catch?" # Word segmentation function def tokenize ( text ): return list ( jieba . cut ( text )) # Document library after word segmentation tokenized_documents = [ tokenize ( doc ) for doc in documents ] # Initialize BM25 model bm25 = BM25Okapi ( tokenized_documents ) # Calculate BM25 score bm25_scores = bm25 . get_scores ( tokenize ( query )) # Embedding model API address EMBEDDING_MODEL_URL = "http://172.16.1.54:11434/api/embeddings" # Function to get the embedding vector def get_embedding ( text ): # Construct request data payload = { "model" : "unclemusclez/jina-embeddings-v2-base-code:latest" , "prompt" : text } # Send POST request response = requests . post ( EMBEDDING_MODEL_URL , json = payload ) if response . status_code == 200 : # Parse the returned embedding vector embedding = response . json (). get ( "embedding" ) return np . array ( embedding ) else : raise Exception ( f"Failed to get embedding: { response . status_code } - { response . text } " )
# Get the embedding vectors of the query and document
query_embedding = get_embedding ( query )
document_embeddings = [ get_embedding ( doc ) for doc in documents ]
# Calculate cosine similarity
vector_scores = cosine_similarity ([ query_embedding ], document_embeddings )[ 0 ]
# Normalized scores
bm25_scores_normalized = ( bm25_scores - np . min ( bm25_scores )) / ( np . max ( bm25_scores ) - np . min ( bm25_scores ))
vector_scores_normalized = ( vector_scores - np . min ( vector_scores )) / ( np . max ( vector_scores ) - np . min ( vector_scores ))
# Weighted fusion
alpha = 0.5
final_scores = alpha * bm25_scores_normalized + ( 1 - alpha ) * vector_scores_normalized
# Sort and output the results
sorted_indices = np . argsort ( final_scores )[:: - 1 ]
for i in sorted_indices : print ( f"Mixed score of document { i + 1 } : { final_scores [ i ] } - { documents [ i ] } " )
Output:
Building prefix dict from the default dictionary ...
Loading model from cache / tmp / jieba . cache
Loading model cost 0.715 seconds .
Prefix dict has been built successfully .Mixed score for
Document 1 : 1.0 - A cat is a cute animal that likes to catch mice. Mixed score for
Document 2 : 0.5282365628163656 - A dog is a good friend of humans and likes to chase cats. Mixed score for
Document 3 : 0.0 - A mouse is a small rodent that cats like to catch.
Note: The scores here are based on my embedding model, and different models may be different.
Summarize
By combining vector retrieval and BM25 retrieval, the hybrid retrieval method can capture semantic information and keyword matching information at the same time, thereby improving the query accuracy of the RAG framework. This method has significant advantages in processing diverse queries and reducing missed detections and false detections, and is an effective strategy to improve the performance of retrieval enhancement generation systems.