Is your RAG search too stupid? Use K-Means clustering to "tune" it

Written by
Audrey Miles
Updated on:July-08th-2025
Recommendation

Improve RAG retrieval efficiency, K-Means clustering to help!

Core content:
1. Master the principle of K-means clustering and its application in text processing
2. Generate text vectors using the BGE-M3 model and practice K-means clustering
3. Practical case of using K-means in the RAG system to enhance retrieval diversity and intelligent query routing

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

 Is your RAG search too stupid? Use K-Means clustering to "tune" it 

Article Objective

This article is aimed at  NLP developers and RAG enthusiasts , and aims to help everyone:

  • Understand the core principles of K-means clustering : Starting from the foundation of unsupervised learning, master the working mechanism of the K-means algorithm and its application potential in text processing.
  • Learn to apply K-means to text data : By combining advanced text embedding models (such as BGE-M3), effective clustering of text vectors can be achieved to mine the semantic structure in the data.
  • Improving the retrieval capabilities of RAG systems : Explore how to use K-means to optimize the Retrieval Augmentation Generation (RAG) system to return more diverse and intelligent retrieval results.

 Tips:
All the codes provided in this article can be run directly. You only need to prepare the necessary models and dependent libraries (such as scikit-learn and FlagEmbedding). Hands-on practice is the key to mastering technology! The open source address of the code for this article is: https://github.com/li-xiu-qi/XiaokeAILabs/tree/main/datas/test_k_means

theme

This article focuses on  the application of K-means clustering in NLP and RAG systems  . The core content includes:

  • Theoretical basis and implementation details of the K-means algorithm.
  • The BGE-M3 model is used to generate high-quality text vectors and combined with K-means for clustering practice.
  • A practical case study of using K-means to enhance retrieval diversity and implement intelligent query routing in the RAG system.

summary

Retrieval-augmented generation (RAG) is a powerful technique, but its retrieval module is often limited by the monotonicity of traditional similarity ranking, resulting in a lack of diversity in results or an inability to accurately match user intent. This paper proposes a solution based on K-means clustering to optimize the retrieval process of RAG by performing unsupervised grouping of text data. We will first introduce the basic principles of the K-means algorithm and its mathematical goals, then show how to vectorize and cluster text in combination with the BGE-M3 embedding model, and finally demonstrate its application effect through two RAG practical cases (diversity-enhanced retrieval and cluster-aware query routing). Whether you want to improve your clustering skills or optimize the performance of the RAG system, this article will provide systematic theoretical support and practical guidance to help you go further in the field of NLP.

Let’s get started!



Preface 

In the field of natural language processing (NLP), retrieval-augmented generation (RAG) technology has rapidly emerged in recent years and has become an important tool for solving knowledge-intensive tasks. By combining the advantages of external knowledge retrieval and generation models, RAG is able to provide more accurate and contextual content when answering complex questions. However, many developers encounter a common pain point when using RAG: the retrieval results are too single or not "smart" enough, resulting in a lack of diversity in the generated content or failure to fully cover user needs.

Why does this problem occur? Traditional retrieval methods are usually based on simple similarity sorting, such as cosine similarity or Euclidean distance. Although simple and efficient, they tend to return documents with highly similar semantics. This method may lead to "information redundancy" in some scenarios and fail to capture the diversity or potential semantic structure of the data. The K-means clustering algorithm, as a classic unsupervised learning method, can help us "tune" the retrieval process. By grouping and semantic mining the data, it can significantly improve the diversity and intelligence of the retrieval.

This article will explore how to use K-means clustering to optimize the retrieval capabilities of the RAG system. We will start with the basic principles of K-means, gradually explain how to apply it to text vector clustering, and finally show its practical application in the RAG system. Whether you are an NLP developer or a RAG enthusiast, this article will provide you with practical technical insights and actionable code examples to help you build more powerful intelligent systems.


1. Basic principles of K-means clustering

Supervised Learning vs Unsupervised Learning 

Before we dive into the k-means algorithm, let us first understand  the difference between supervised  and  unsupervised learning  :

Supervised Learning :

  • The algorithm is trained using data with labels (correct answers)
  • The model learns the mapping between input data and output labels
  • The goal is to predict the label or value of new data
  • Typical algorithms: classification (such as decision trees, support vector machines) and regression (such as linear regression)
  • Applications: spam detection, image recognition, stock price prediction

Unsupervised Learning :

  • The data processed by the algorithm has no labels
  • Models learn by identifying inherent structures and patterns in data
  • The goal is to discover hidden patterns or groups in the data
  • Typical algorithms: clustering (such as k-means, DBSCAN) and dimensionality reduction (such as PCA)
  • Applications: Customer segmentation, anomaly detection, topic discovery

K-means is an unsupervised learning algorithm that aims to divide the dataset into  clusters so that each data point has the smallest distance to the nearest cluster center. Its operation process can be divided into the following detailed steps:

Detailed explanation of K-means algorithm steps 

K-means is a classic clustering algorithm, whose goal is to divide the data set into  clusters, so that the sum of squares of distances from each data point to the center of its cluster (i.e., the sum of squares within the cluster, WCSS) is minimized. Mathematically, the optimization goal of K-means can be expressed as:

in:

  •  is the objective function, which represents the total sum of squared distances from all data points to their cluster centers;
  •  is the preset number of clusters;
  •  It is  The set of data points in a cluster;
  •  is a data point (usually a vector);
  •  It is  The center vector of each cluster;
  •  Represents data points  To cluster center  The square of the Euclidean distance.

The algorithm gradually approximates through the following steps  The local optimal solution of .

1️⃣ Initialization: Select the initial cluster center

The goal of the initialization phase is to  Select the initial center of each cluster The choice of initial centers has a significant impact on the final clustering results and convergence speed.

  • Random selection method : randomly select from the data set  points as the initial cluster centers.

    import  numpy  as  np

    # data is a dataset with a shape of (n_samples, n_features)
    # k is the number of clusters
    initial_centers = data[np.random.choice(data.shape[ 0 ], k, replace= False )]
  • K-means++ initialization : To avoid the local optimal problem caused by random initialization, K-means++ selects more dispersed initial centers through the following steps:

  1. Randomly select the first center point from the dataset ;
  2. For each data point , calculate the minimum distance to the nearest existing center ;
  3. By probability  Select the next center point. The farther away the point is, the greater the probability of being selected.
  4. Repeat steps 2-3 until you select  A center point.

The advantage of K-means++ is that the initial center distribution is more uniform, which can effectively improve the possibility of the algorithm finding the global optimal solution.

2️⃣ Assignment: Assign data points to the nearest cluster

During the allocation phase, each data point  Assigned to the cluster center closest to it  The cluster where the cluster is located. The distance is usually calculated using the square of the Euclidean distance, and the assignment rule is:

in:

  •  It is  data points;
  •  It is  The center of the cluster;
  •  is the cluster number that minimizes the distance.

Implementation code:

import  numpy  as  np

def assign_to_clusters (data, centers) : 
    # data: data matrix, shape (n_samples, n_features)
    # centers: cluster center matrix, shape (k, n_features)
    distances = np.sum((data[:, np.newaxis, :] - centers) **  2 , axis= 2 )   # (n_samples, k)
    cluster_labels = np.argmin(distances, axis= 1 )   # Cluster number for each data point
    return  cluster_labels

This code uses the broadcast mechanism to efficiently calculate the squared distances of all data points to all centers and assign each data point the nearest cluster number.

3️⃣ Update: Recalculate cluster centers

In the update phase, the center of each cluster is recalculated based on the current cluster allocation results. The new cluster center is the mean of all data points in the cluster, and the formula is:

in:

  •  It is  The number of data points in a cluster;
  •  is the vector sum of all data points in the cluster.

Implementation code:

def update_centers (data, cluster_labels, k) : 
    # data: data matrix, shape (n_samples, n_features)
    # cluster_labels: cluster number array, shape (n_samples,)
    # k: number of clusters
    centers = np.zeros((k, data.shape[ 1 ]))
    for  j  in  range(k):
        points_in_cluster = data[cluster_labels == j]
        if  len(points_in_cluster) >  0 :   # avoid empty clusters
            centers[j] = np.mean(points_in_cluster, axis= 0 )
    return  centers

Updated Center  is the centroid of the data points in the cluster, which minimizes the sum of squared distances within the cluster. This process directly affects the objective function  of decline.

4️⃣ Iteration: Repeat until convergence

Repeat the assign and update steps until either of the following conditions is met:

  • The change in cluster center is less than a certain threshold (e.g. );
  • The cluster assignment of data points no longer changes;
  • The maximum number of iterations has been reached.

Complete K-means algorithm implementation:

def kmeans (data, k, max_iters= 300 , tol= 1e-4 ) : 
    # Initialize the center
    centers = data[np.random.choice(data.shape[ 0 ], k, replace= False )]
    
    for  _  in  range(max_iters):
        # Assign data points to clusters
        cluster_labels = assign_to_clusters(data, centers)
        # Update cluster centers
        new_centers = update_centers(data, cluster_labels, k)
        # Check convergence
        if  np.sum((new_centers - centers) **  2 ) < tol:
            break
        centers = new_centers
    
    return  cluster_labels, centers

Each iteration makes the objective function  It decreases monotonically and eventually converges to the local optimal solution.

5️⃣ Time complexity

  • Single iteration complexity:,in  is the number of data points, is the number of clusters, is the data dimension;
  • Total complexity:,in  is the number of iterations, usually  Smaller.

Detailed explanation of K-means algorithm parameters 

exist scikit-learn of KMeans In the class, the algorithm behavior can be adjusted through the following parameters:

  • n_clusters : number of clusters , needs to be selected according to data characteristics or elbow rule.
  • init : Initialization method, default is 'k-means++', optional is 'random' or custom array.
  • n_init : number of runs, default is 10,  Minimal results.
  • max_iter : Maximum number of iterations, default is 300.
  • tol : convergence threshold, default 1e-4.
  • random_state : Random seed to ensure reproducibility.
  • algorithm : Algorithm variant, default 'lloyd', optional 'elkan' (suitable for low-dimensional data).

Sample code:

from  sklearn.cluster  import  KMeans

kmeans = KMeans(
    n_clusters = 5 ,
    init= 'k-means++' ,
    n_init = 10 ,
    max_iter = 300 ,
    tol = 1e-4 ,
    random_state = 42 ,
    algorithm = 'lloyd'
)

Mathematical goal of K-means 

The core of K-means is to optimize the objective function :

The algorithm works by alternating the allocation steps (fixed ,optimization ) and the update step (fixed ,optimization ), gradually decrease , until convergence. Run multiple times and select the best  It can alleviate the local optimal problem caused by the initial center selection.

Usage Examples 

The following is an example of clustering two-dimensional data using the above K-means implementation, including data generation, visualization, and presentation of clustering results.

import  numpy  as  np
import  matplotlib.pyplot  as  plt

# Define K-means function
def assign_to_clusters (data, centers) : 
    distances = np.sum((data[:, np.newaxis, :] - centers) **  2 , axis= 2 )
    return  np.argmin(distances, axis= 1 )

def update_centers (data, cluster_labels, k) : 
    centers = np.zeros((k, data.shape[ 1 ]))
    for  j  in  range(k):
        points_in_cluster = data[cluster_labels == j]
        if  len(points_in_cluster) >  0 :
            centers[j] = np.mean(points_in_cluster, axis= 0 )
    return  centers

def kmeans (data, k, max_iters= 300 , tol= 1e-4 ) : 
    centers = data[np.random.choice(data.shape[ 0 ], k, replace= False )]
    for  _  in  range(max_iters):
        cluster_labels = assign_to_clusters(data, centers)
        new_centers = update_centers(data, cluster_labels, k)
        if  np.sum((new_centers - centers) **  2 ) < tol:
            break
        centers = new_centers
    return  cluster_labels, centers

# Generate sample data
np.random.seed( 42 )
data1 = np.random.normal( 01 , ( 1002 )) + [ 22 ]
data2 = np.random.normal( 01 , ( 1002 )) + [ -2-2 ]
data3 = np.random.normal( 01 , ( 1002 )) + [ 2-2 ]
data = np.vstack([data1, data2, data3])

# Run K-means
k =  3
cluster_labels, centers = kmeans(data, k)

# Visualize the results
plt.scatter(data[:,  0 ], data[:,  1 ], c=cluster_labels, cmap= 'viridis' , s= 50 , alpha= 0.5 )
plt.scatter(centers[:,  0 ], centers[:,  1 ], c= 'red' , s= 200 , marker= 'x' , label= 'Centers' )
plt.title( 'K-means Clustering Result (k=3)' )
plt.xlabel( 'X' )
plt.ylabel( 'Y' )
plt.legend()
plt.show()

This example generates three sets of 2D normally distributed data, runs the K-means algorithm to divide the data into 3 clusters, and uses Matplotlib to visualize the clustering results, with the cluster centers marked with red "x".

?2. Text vector generation and clustering practice

Before starting the pleasant practice, we first install the corresponding dependencies:

pip install numpy scikit-learn pandas matplotlib FlagEmbedding

?2.1 Generate text vectors using the BGE-M3 model 

BGE-M3 is a powerful embedding model that can generate dense vectors and ColBERT vectors (contextualized late-interaction vectors) for text. Let's understand its role through a specific example.

Suppose we have the following input sentence:

  • "What is BGE M3?"
  • "Defination of BM25"

After deploying the BGE-M3 model locally, we can extract the vectors with the following code:

from  FlagEmbedding  import  BGEM3FlagModel
import  numpy  as  np

model_path =  r"C:\Users\k\Desktop\BaiduSyncdisk\baidu_sync_documents\hf_models\bge-m3"
model = BGEM3FlagModel(model_path, use_fp16= True )

sentences = [ "What is BGE M3?""Definition of BM25" ]
output = model.encode(
    sentences=sentences,
    batch_size = 12 ,
    max_length = 8192 ,
    return_dense = True ,
    return_sparse = False ,
    return_colbert_vecs = False
)

dense_vecs = output[ 'dense_vecs' ]   # shape: (2, 1024)
print(dense_vecs.shape)   # Output: (2, 1024)

After running, the output shows:

  • dense_vecs The shape is , indicating that each of the two sentences has a 1024-dimensional dense vector.
  • The vector of each sentence can be obtained by dense_vecs[0] and dense_vecs[1] Extraction, shape .

These vectors capture the semantic information of sentences and provide high-quality input for k-means clustering. The BGE-M3 model has been trained with a large amount of text data and can map semantically similar texts to similar locations in the vector space, which is crucial for subsequent clustering tasks. High-quality text vectors are the basis of clustering effects and can help the algorithm better identify the intrinsic connections between texts.

?2.2 Applying k-means to sentence vectors 

Now that we have a vector representation, we can feed it into the k-means algorithm. Suppose we want to classify the sentences into two categories (), which can be implemented using Python’s scikit-learn library:

from  sklearn.cluster  import  KMeans

# Input the dense vector generated by BGE-M3
kmeans = KMeans(n_clusters= 2 , random_state= 42 )
clusters = kmeans.fit_predict(dense_vecs)

print( f"Clustering results:  {clusters} " )   # Output example: Clustering results: [0 1]

After running, assume the result is [0, 1], indicating that the first sentence belongs to cluster 0 and the second sentence belongs to cluster 1. The cluster centers can be obtained by kmeans.cluster_centers_ Get, the shape is These cluster centers can be regarded as representative vectors of each cluster and can be used to understand the semantic meaning of each cluster. In practical applications, we can analyze the center vector of each cluster and find the original text most similar to the vector to understand the topic represented by the cluster.

?2.3 Cosine Similarity and K-means Clustering 

scikit-learn's K-means uses Euclidean distance by default, but cosine similarity is often more appropriate when dealing with text vectors. We can make standard K-means equivalent to using cosine similarity by normalizing the vectors:

from  sklearn.cluster  import  KMeans
from  sklearn.preprocessing  import  normalize

# Normalize the vector (L2 norm)
normalized_vecs = normalize(dense_vecs, norm= 'l2' )

# Apply K-means on the normalized vector
cos_kmeans = KMeans(n_clusters= 2 , random_state= 42 )
cos_clusters = cos_kmeans.fit_predict(normalized_vecs)

print( f"Clustering results based on cosine similarity:  {cos_clusters} " )

Principle explanation : When the vectors are normalized, they all lie on the unit hypersphere. For this case, there is a mathematical relationship between the square of the Euclidean distance and the cosine similarity:

Therefore, minimizing the Euclidean distance on the normalized vectors is equivalent to maximizing the cosine similarity. This technique allows us to use standard K-means to implement clustering based on cosine similarity. Cosine similarity measures the angle between two vector directions and focuses more on the semantic direction of the text rather than the absolute size of the vector, which is often more in line with our needs in text clustering, especially when dealing with texts of varying lengths, cosine similarity can better capture the topic information of the text.

?2.4 Complete Case: News Headline Clustering 

Let's use a complete example to show how to combine BGE-M3 and K-means for practical applications. This example will perform cluster analysis on 20 news headlines of different topics, hoping that the algorithm can identify potential topic groups.

# Author: Xiao Ke
# Date: March 30, 2025
# Copyright (c) 2025 Xiaoke & Xiaoke AI Research Institute. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from  FlagEmbedding  import  BGEM3FlagModel
import  numpy  as  np
from  sklearn.cluster  import  KMeans
from  sklearn.preprocessing  import  normalize
from  sklearn.metrics  import  silhouette_score
import  pandas  as  pd
import  matplotlib.pyplot  as  plt
from  sklearn.decomposition  import  PCA

# Load the model
model_path =  r"C:\Users\k\Desktop\BaiduSyncdisk\baidu_sync_documents\hf_models\bge-m3"
model = BGEM3FlagModel(model_path, use_fp16= True )

# A collection of sample news headlines (covering five topics: technology, sports, politics, entertainment, and health)
news_titles = [
    # Technology Topics
    "Apple releases the latest iPhone 15 series, equipped with A17 chip" ,
    "Google launches new generation of AI assistant, supports natural language understanding" ,
    "Tesla's autonomous driving technology has achieved a breakthrough, reducing the accident rate by 30%" ,
    "Microsoft announces acquisition of AI startup to strengthen cloud service capabilities" ,

    # Sports Topics
    "Messi scored in his Paris debut and the fans cheered" .
    "Tokyo Olympics closed, China ranked second in gold medal list" ,
    "NBA Playoffs: Lakers beat Heat to win championship" ,
    "The Chinese national football team lost to the Japanese team in the World Cup qualifiers, and the situation of qualifying is grim" ,

    # Political topics
    "Chinese and US heads of state hold phone call to exchange views on bilateral relations" ,
    "EU passes new climate law, pledges to achieve carbon neutrality by 2050" ,
    "UN General Assembly convenes, world leaders discuss global governance" ,
    "Britain announces new trade policy after Brexit, strengthening cooperation with Asia" ,

    #Entertainment Theme
    "The new movie Dune is a global hit, with box office revenue exceeding $400 million" ,
    "Pop singer Taylor Swift releases new album, fans are excited" ,
    "Netflix hit series 'Squidward Gamble' sets viewership record" ,
    "Oscars ceremony held, Nomadland wins Best Picture" ,

    # Health Topics
    "New study finds regular exercise may reduce Alzheimer 's risk"
    "Global COVID-19 vaccination exceeds 3 billion doses, coverage in developing countries is still low" ,
    "Medical experts recommend reducing the intake of ultra-processed foods to reduce the risk of chronic diseases" ,
    "Mental health issues on the rise among young people, experts urge greater attention"
]

# Generate text vector
news_vectors = model.encode[news_titles]( 'dense_vecs' )

# Normalize the vectors and prepare for clustering based on cosine similarity
normalized_vectors = normalize(news_vectors, norm= 'l2' )

# K-means clustering using normalized vectors (equivalent to cosine similarity)
kmeans = KMeans(
    n_clusters = 5 ,
    init= 'k-means++' ,
    n_init = 10 ,
    random_state = 42
)
clusters = kmeans.fit_predict(normalized_vectors)

# Create the resulting DataFrame
results_df = pd.DataFrame({
    'title' : news_titles,
    'cluster' : clusters
})

# Print the news headlines for each cluster
for  cluster_id  in  range( 5 ):
    print( f"\n=== cluster  {cluster_id}  ===" )
    cluster_titles = results_df[results_df[ 'cluster' ] == cluster_id][ 'title' ]
    for  title  in  cluster_titles:
        print( f"-  {title} " )

# Calculate clustering evaluation indicators
silhouette_avg = silhouette_score(normalized_vectors, clusters)
print( f"\nClustering silhouette coefficient:  {silhouette_avg: .4 f} " )

# Visualize clustering results
# Display Chinese
plt.rcParams[ 'font.sans-serif' ] = [ 'SimHei' ]
plt.rcParams[ 'axes.unicode_minus' ] =  False
# Use PCA to reduce dimensionality for visualization
pca = PCA(n_components= 2 )
reduced_vectors = pca.fit_transform(news_vectors)

# Set the drawing style
plt.figure(figsize=( 108 ))
colors = [ '#ff9999''#66b3ff''#99ff99''#ffcc99''#c2c2f0' ]
markers = [ 'o''s''^''D''P' ]

# Plot data points
for  i  in  range( 5 ):
    # Get the points of the current cluster
    cluster_points = reduced_vectors[clusters == i]
    # Draw all the points of this cluster
    plt.scatter(
        cluster_points[:,  0 ],
        cluster_points[:,  1 ],
        c=colors[i],
        marker=markers[i],
        label = f'Topic  {i} ' ,
        s = 100 ,
        edgecolors= 'black'
    )

# Draw cluster centers
centers_reduced = pca.transform(kmeans.cluster_centers_)
plt.scatter(
    centers_reduced[:,  0 ],
    centers_reduced[:,  1 ],
    c = 'black' ,
    marker = 'X' ,
    s = 200 ,
    label = 'Cluster Center'
)

# Add legend and title
plt.legend(fontsize= 12 )
plt.title( 'News title clustering results (PCA reduced to 2 dimensions)' , fontsize= 16 )
plt.tight_layout()
plt.savefig( 'news_clusters_visualization.png' , dpi= 300 )
plt.show()

Output:

=== Cluster 0 ===
- Tokyo Olympics closed, China ranked second in gold medal list
- China's national football team lost to Japan in the World Cup qualifiers, and the situation of qualifying is grim

=== Cluster 1 ===
- Messi scored on his debut in Paris and the fans cheered
- NBA playoffs: Lakers beat Heat to win championship
- The new movie "Dune" is a global hit, with box office exceeding $400 million
- Pop singer Taylor Swift releases new album, fans are enthusiastic
- Netflix hit series "Squidward" sets viewership record
- Oscars ceremony held, "Nomadland" won the Best Picture

=== Cluster 2 ===
- Apple releases the latest iPhone 15 series, equipped with A17 chip
- Google launches a new generation of artificial intelligence assistant that supports natural language understanding
- Tesla's autonomous driving technology has achieved a breakthrough, reducing the accident rate by 30%
- Microsoft announces acquisition of AI startup to strengthen cloud service capabilities
- EU passes new climate law, commits to achieving carbon neutrality by 2050
- The UK announced a new trade policy after Brexit to strengthen cooperation with Asia

=== Cluster 3 ===
- Chinese and US heads of state hold phone call to exchange views on bilateral relations
- The United Nations General Assembly convenes, and leaders from various countries discuss global governance
- Global COVID-19 vaccination exceeds 3 billion doses, coverage in developing countries is still low

=== Cluster 4 ===
- New study finds regular exercise may reduce risk of Alzheimer's disease
- Medical experts recommend reducing the intake of ultra-processed foods to reduce the risk of chronic diseases
- Mental health issues on the rise among young people, experts call for more attention

Cluster silhouette coefficient: 0.0564

Visualizing clustering results

Result analysis and application

This example shows how to convert text data into vector representation and apply k-means for clustering analysis. From the results, we can see that even though we did not provide any label information in advance, the algorithm successfully grouped the news headlines by topic and identified five different topics: technology, sports, politics, entertainment, and health. Although the silhouette coefficient is low, this may be due to the simplicity of the news headlines and the potential correlation between the topics, but from the actual clustering results, the algorithm has captured the semantic similarity to a certain extent.

This clustering method has many uses in real-world applications:

  1. Content recommendation system : Based on the cluster to which the content the user is reading belongs, recommend other articles in the same cluster. For example, if the user is reading a news article about iPhone 15, the system can recommend other articles about technology companies launching new products.
  2. Automatic filing and organization : Automatically generate subject categories for large amounts of documents or news, making it easier for users to find and manage information. For example, daily news can be automatically filed according to themes such as politics, economy, and sports.
  3. Hot topic discovery : Identify hot topics by analyzing which clusters contain more recent articles. For example, in social media data analysis, you can find the hot topics that are currently being discussed by users.
  4. Public opinion analysis : Analyze the sentiment and quantity changes of different topic clusters to understand the public's views and attention on different topics. For example, you can analyze user comments on different political topics to determine the direction of public opinion.
  5. Text summary : The most representative text can be selected from each cluster as the summary of the cluster to help users quickly understand the main content of the cluster.
  6. Constructing a knowledge graph : Each cluster can be regarded as a concept, and the text within the cluster can be regarded as an instance of the concept, thus assisting in constructing a knowledge graph.

In the following RAG practical section, we will explore how to apply this clustering technology to build a smarter retrieval enhancement generation system.

?3. RAG Practice: The Magic of k-means

3.1 Diversity-enhanced retrieval methods 

# Author: Xiao Ke
# Date: March 30, 2025
# Copyright (c) 2025 Xiaoke & Xiaoke AI Research Institute. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from  FlagEmbedding  import  BGEM3FlagModel
import  numpy  as  np
from  sklearn.cluster  import  KMeans
from  sklearn.metrics.pairwise  import  cosine_similarity

# Load the model
# Use the BGE-M3 model for text vectorization, which supports multi-language text encoding
model_path =  r"C:\Users\k\Desktop\BaiduSyncdisk\baidu_sync_documents\hf_models\bge-m3"
model = BGEM3FlagModel(model_path, use_fp16= True )   # Use FP16 to accelerate reasoning

def diversity_enhanced_retrieval (query, doc_vectors, doc_texts, top_k= 5 , diversity_clusters= 3 ) : 
    """
    Return diverse search results, covering different semantic clusters.
    This method ensures diversity of results while maintaining relevance through K-means clustering.

    Args:
        query (str): query text
        doc_vectors (np.ndarray): document vector collection
        doc_texts (list): document text list
        top_k (int): number of documents returned
        diversity_clusters (int): number of clusters, used to increase the diversity of results

    Returns:
        list: selected document index list
    """

    # Encode the query and calculate the similarity
    # Convert the query text into a vector representation
    query_vec = model.encode[[query]][ 'dense_vecs' ]( 0 )
    # Calculate the cosine similarity between the query vector and all document vectors
    similarities = cosine_similarity[[query_vec], doc_vectors]( 0 )

    # Get candidate documents
    # Select the top_n documents with the highest similarity as candidate sets
    top_n = min(top_k *  3 , len(doc_vectors))   # Take 3 times of top_k or all documents
    candidate_indices = np.argsort[similarities][-top_n:](:: -1 )   # Sort by similarity in descending order
    candidate_vectors = doc_vectors[candidate_indices]   # Get candidate document vectors

    # Perform K-means clustering
    # Cluster candidate documents to ensure semantic diversity
    kmeans = KMeans(n_clusters=min(diversity_clusters, len(candidate_vectors)),
                    random_state= 42 , n_init= 10 )   # Set clustering parameters
    clusters = kmeans.fit_predict(candidate_vectors)   # Perform clustering and predict cluster labels

    # Select the most similar documents from each cluster
    selected_indices = []
    cluster_dict = {}

    # Group by cluster and record similarity
    # Group each document by cluster ID and save its original index and similarity
    for  idx, cluster_id  in  enumerate(clusters):
        cluster_dict.setdefault(cluster_id, []).append((candidate_indices[idx], similarities[candidate_indices[idx]]))

    # Select the best document from each cluster
    # For each cluster, select the document with the highest similarity
    for  cluster_id  in  range(min(diversity_clusters, len(cluster_dict))):
        if  cluster_dict.get(cluster_id):
            best_doc = max[cluster_dict[cluster_id], key= lambda  x: x[ 1 ]]( 0 )
            selected_indices.append(best_doc)

    # Supplement insufficient documentation
    # If the number of documents selected from the cluster is less than top_k, add more from the remaining candidate documents
    remaining = [i  for  i  in  candidate_indices  if  i  not in  selected_indices] 
    if  len(selected_indices) < top_k  and  remaining:
        remaining_similarities = [similarities[i]  for  i  in  remaining]
        extra_indices = [remaining[i]  for  i  in  np.argsort[remaining_similarities](-top_k + len(selected_indices):)]
        selected_indices.extend(extra_indices)

    return  selected_indices[:top_k]

def generate_sample_news_data (n_docs= 20 ) : 
    """
    Generate simulated news headlines and their vector representations

    Args:
        n_docs (int): The number of documents to generate

    Returns:
        tuple: (document vector array, document text list)
    """

    # Chinese news headline example
    news_titles = [
        "Artificial intelligence has made a major breakthrough in medical research" ,
        "Emerging tech startups raise billions in funding" ,
        "Climate Change Affects Global Agricultural Development" ,
        "Quantum computing technology will make new progress in 2025" ,
        "Artificial intelligence model predicts stock market trends" ,
        "Renewable energy outpaces traditional fossil fuels" ,
        "New breakthrough in cancer treatment thanks to AI technology" ,
        "Tech giants face new privacy regulations" ,
        "Global data breaches hit record highs" ,
        "AI assistants become more humane" ,
        "Self- driving cars revolutionizing transportation"
        "Climate technology receives huge investment support "
        "New AI algorithm solves complex problems" ,
        "Cybersecurity threats rise in the digital age "
        "Medical AI reduces diagnostic errors" ,
        "Technological innovation drives economic growth" ,
        "AI-driven robots enter the workplace" ,
        "Sustainable technology solutions gain attention"
        "Data Science Changes Business Strategies" ,
        "Quantum AI research opens new frontiers"
    ]

    # If more documents are needed, repeat the title list
    news_titles = news_titles[:n_docs]  if  len(news_titles) >= n_docs  else  news_titles * (n_docs // len(news_titles) +  1 )
    news_titles = news_titles[:n_docs]

    # Generate vector
    # Use the model to convert text into vector representation
    news_vectors = model.encode[news_titles]( 'dense_vecs' )
    return  news_vectors, news_titles

# Test code
if  __name__ ==  "__main__" :
    # Generate test data
    doc_vectors, doc_texts = generate_sample_news_data(n_docs= 20 )

    # Test query
    query =  "Progress in artificial intelligence research"
    result_indices = diversity_enhanced_retrieval(query, doc_vectors, doc_texts, top_k= 5 , diversity_clusters= 3 )

    # Print results
    print( "Query:" , query)
    print( "\nSearch results:" )
    # Calculate the similarity between the query and all documents
    similarities = cosine_similarity([model.encode[[query]][ 'dense_vecs' ]( 0 )], doc_vectors)[ 0 ]

    # Sort the results by similarity in descending order
    sorted_results = sorted([(idx, similarities[idx])  for  idx  in  result_indices],
                            key= lambda  x: x[ 1 ], reverse= True )

    # Print the sorted results
    for  idx, sim  in  sorted_results:
        print( f"Document  {idx}{doc_texts[idx]}  (similarity:  {sim: .4 f} )" )

Output:


Query: Progress in Artificial Intelligence Research

Search Results:
Document 0: Artificial intelligence has made significant breakthroughs in medical research (Similarity: 0.7393)
Document 12: New AI algorithm solves complex problems (Similarity: 0.6835)
Document 6: New breakthrough in cancer treatment thanks to AI technology (Similarity: 0.6642)
Document 10: Self-driving cars revolutionize transportation (Similarity: 0.5270)
Document 7: Tech giants face new privacy regulations (Similarity: 0.5037)

This diverse retrieval method can ensure that the RAG system obtains information from different angles, avoids overly simplistic answers, and greatly improves the comprehensiveness of the generated content. In traditional similarity-based retrieval, the returned results may be highly similar, resulting in information redundancy. By introducing K-means clustering, we can divide the retrieval results into several different semantic clusters and select the most representative documents from each cluster, thereby ensuring the diversity of the retrieval results and enabling the RAG system to more comprehensively understand user queries and generate richer answers. This is especially effective for questions that require multi-faceted information.

3.2 Cluster-aware query routing system 

Complex RAG systems often contain multiple knowledge sources. Clustering-based routing can intelligently select the most relevant knowledge base:

# Author: Xiao Ke
# Date: March 30, 2025
# Copyright (c) 2025 Xiaoke & Xiaoke AI Research Institute. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from  FlagEmbedding  import  BGEM3FlagModel
import  numpy  as  np
from  sklearn.cluster  import  KMeans
from  sklearn.metrics.pairwise  import  cosine_similarity

# Initialize the BGE-M3 model
model_path =  r"C:\Users\k\Desktop\BaiduSyncdisk\baidu_sync_documents\hf_models\bge-m3"
model = BGEM3FlagModel(model_path, use_fp16= True )

class ClusterAwareRouter : 
    """A clustering-based query routing system that directs queries to the most relevant expertise base"""

    def __init__ (self, knowledge_bases, n_clusters= 5 ) : 
        """
        Initialize the routing system

        parameter:
            knowledge_bases: dictionary {knowledge base name: {"vectors": document vectors, "documents": documents}}
            n_clusters: The number of clusters per knowledge base, the default value is 5
        """

        self.knowledge_bases = knowledge_bases
        self.kb_centers = {}
        self.n_clusters = n_clusters

        # Create a clustering model for each knowledge base
        for  kb_name, kb_data  in  knowledge_bases.items():
            if  len(kb_data[ "vectors" ]) < n_clusters:
                raise  ValueError( f"The number of vectors in knowledge base  {kb_name}  is less than the specified number of clusters  {n_clusters} " )

            kmeans = KMeans(n_clusters=n_clusters, random_state= 42 , n_init= 10 )
            kmeans.fit(kb_data[ "vectors" ])
            self.kb_centers[kb_name] = kmeans.cluster_centers_

    def route_query (self, query, top_k= 1 ) : 
        """
        Route queries to the most relevant knowledge base

        parameter:
            query: string, the query entered by the user
            top_k: Returns the top k most relevant knowledge bases, default is 1

        return:
            If top_k=1, return the best knowledge base name and similarity score (str, float)
            If top_k>1, return a list of knowledge base name and similarity score pairs sorted by similarity [(str, float), ...]
        """

        # Encode the query as a vector
        query_vec = model.encode[[query], max_length= 512 ][ 'dense_vecs' ]( 0 )

        # Calculate the maximum similarity between the query and the center of each knowledge base cluster
        similarities = {}
        for  kb_name, centers  in  self.kb_centers.items():
            sim_scores = cosine_similarity[[query_vec], centers]( 0 )
            similarities[kb_name] = np.max(sim_scores)

        # Sort by similarity and select top_k results
        sorted_kbs = sorted(similarities.items(), key= lambda  x: x[ 1 ], reverse= True )

        if  top_k ==  1 :
            return  sorted_kbs[ 0 ][ 0 ], sorted_kbs[ 0 ][ 1 ]   # Returns the best knowledge base name and similarity score
        return  [(kb[ 0 ], kb[ 1 ])  for  kb  in  sorted_kbs[:top_k]]   # Returns the knowledge base name and similarity score pair

    def retrieve_documents (self, query, kb_name, top_k= 3 ) : 
        """
        Retrieve the most relevant documents for the query in the specified knowledge base

        parameter:
            query: user query
            kb_name: knowledge base name
            top_k: Returns the top k most relevant documents, default is 3

        return:
            A list containing documents and similarity scores in the format [(document, similarity_score), ...]
        """

        # Check if the knowledge base exists
        if  kb_name  not in  self.knowledge_bases: 
            return  []

        # Encode the query as a vector
        query_vec = model.encode[[query], max_length= 512 ][ 'dense_vecs' ]( 0 )

        # Get document vectors and document content in the knowledge base
        kb_vectors = self.knowledge_bases[kb_name][ "vectors" ]
        kb_documents = self.knowledge_bases[kb_name][ "documents" ]

        # Calculate the similarity between the query and all documents in the knowledge base
        similarities = cosine_similarity[[query_vec], kb_vectors]( 0 )

        # Get the document index with the highest similarity ranking
        top_indices = np.argsort[similarities][:: -1 ](:top_k)

        # Return related documents and their similarity scores
        return  [(kb_documents[i], similarities[i])  for  i  in  top_indices]

# Example Usage
if  __name__ ==  "__main__" :
    # Create sample knowledge base data
    sample_docs = {
        "Medicine" : {
            "documents" : [
                "The treatment of diabetes mellitus includes insulin therapy and diet control. "
                "Common symptoms of flu include fever, cough and fatigue. "
                "Heart disease prevention requires regular exercise and a healthy diet."
            ],
            "vectors"None
        },
        "technology" : {
            "documents" : [
                "Python is widely used for machine learning and data analysis." ,
                "Cloud computing provides a scalable infrastructure solution." ,
                "AI models require a lot of computing resources."
            ],
            "vectors"None
        }
    }

    # Generate vector representation
    for  kb_name  in  sample_docs:
        texts = sample_docs[kb_name][ "documents" ]
        vectors = model.encode[texts, batch_size= 3 ]( 'dense_vecs' )
        sample_docs[kb_name][ "vectors" ] = vectors

    # Initialize the router
    router = ClusterAwareRouter(sample_docs, n_clusters= 2 )

    # Test query
    test_queries = [
        "What are the symptoms of a cold?" ,
        "What is the best programming language for AI development?"
    ]

    # Execute routing
    for  query  in  test_queries:
        best_kb, similarity = router.route_query(query)   # Get the best knowledge base and its similarity
        print( f"Query:  {query} " )
        print( f"Knowledge base routed to:  {best_kb}  (similarity:  {similarity: .4 f} )" )

        # Retrieve relevant documents and display them
        relevant_docs = router.retrieve_documents(query, best_kb)
        print( "Search results:" )
        for  i, (doc, score)  in  enumerate(relevant_docs,  1 ):
            print( f" {i} . Document:  {doc}  (similarity:  {score: .4 f} )" )
        print( "\n"  +  "-" * 50  +  "\n" )

Output:


Query: What are the symptoms of a cold?
Knowledge base routed to: Medicine (similarity: 0.6908)
Search Results:
1. Document: Common symptoms of influenza include fever, cough, and fatigue. (Similarity: 0.6908)
2. Document: Heart disease prevention requires regular exercise and a healthy diet. (Similarity: 0.4443)
3. Document: Treatments for diabetes include insulin therapy and diet control. (Similarity: 0.3520)

--------------------------------------------------

Query: What is the best programming language for AI development?
Knowledge base routed to: Technology (similarity: 0.6068)
Search Results:
1. Documentation: AI models require a lot of computing resources. (Similarity: 0.5529)
2. Documentation: Python is widely used in machine learning and data analysis. (Similarity: 0.5068)
3. Documentation: Cloud computing provides a scalable infrastructure solution. (Similarity: 0.3390)

In a large RAG system containing documents from a variety of professional fields, this routing mechanism can significantly improve the professionalism of answers and avoid confusion across domain knowledge. By clustering the documents in each knowledge base, we can extract representative semantic clusters for each knowledge base. When a user asks a query, the system compares the query vector with the cluster centers of each knowledge base and selects the knowledge base that is most relevant to the query, thereby limiting the search scope to the knowledge base that is most likely to contain the answer, improving retrieval efficiency and accuracy. This method is particularly suitable for scenarios with multiple vertical domain knowledge bases, such as a question-answering system that includes medical, legal, and technical documents. Of course, no one should put multiple fields into one system...

In fact, I mainly want to give you some inspiration. For example, in multi-round conversations, you can use clustering to load historical conversations, which will have a better effect. K-means has many applications in the RAG scenario. You can explore it freely. I believe you will gain a lot.


Summary and Outlook

Technology Panorama 

  • Unsupervised learning: The k-means clustering algorithm provides an effective way to group data without labels.

  • Text Embedding Model: The BGE-M3 model is able to generate high-quality text vectors and provide powerful feature representations for downstream tasks such as clustering.

  • Retrieval-augmented Generation (RAG): By optimizing the retrieval process, clustering techniques can significantly improve the performance of RAG systems and the diversity of generated content.

Learning Summary 

  • Mastered the basic principles and implementation steps of the k-means clustering algorithm.

  • Learned to generate dense vector representations of text using the BGE-M3 model.

  • Learned how to apply k-means for text clustering in NLP tasks.

  • This paper explores how to use clustering technology to enhance the retrieval effect of the RAG system, including improving the diversity of retrieval results and realizing intelligent query routing.

Hands-on Challenge 

  • Try to modify the n_clusters parameter in the news headline clustering case, observe the changes in the clustering results, and analyze the impact of different K values ​​on the results.

  • Based on the provided diversity-enhanced search code, try using different diversity_clusters values ​​to analyze the changes in the diversity of the search results.

  • Expand the cluster-aware query routing system, add more knowledge bases, and design more complex query scenarios for testing