RAG Advanced: Embedding Models Principles and Selection

Written by
Clara Bennett
Updated on:June-29th-2025
Recommendation

Master RAG advanced, starting with embedded models.

Core content:
1. Basic concepts and core principles of embedded models
2. Mainstream classification and selection guide of embedded models
3. Actual case: SQuAD dataset evaluation and code analysis

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

1. Concepts and core principles

1. The Nature of Embedding Models

Embedding Model is a technology that maps discrete data (such as text, images) to continuous vector space. Through high-dimensional vector representation (such as 768 dimensions or 3072 dimensions), the model can capture the semantic information of the data, making semantically similar texts closer in the vector space. For example, "forgot password" and "account locked" will be encoded as similar vectors, thus supporting semantic retrieval rather than just keyword matching.

2. Core role

Semantic encoding: Convert text, images, etc. into vectors, retaining contextual information (such as BERT's CLS Token or mean pooling.)

Similarity calculation: The correlation between vectors is measured by cosine similarity, Euclidean distance, etc., to support applications such as retrieval enhanced generation (RAG) and recommendation systems.

Information dimensionality reduction: compress complex data into low-dimensional dense vectors to improve storage and computing efficiency.

3. Key technical principles

Context dependence: Modern models (such as BGE-M3) dynamically adjust the vectors to capture the meaning of polysemous words in different contexts.

Training methods: contrastive learning (such as Word2Vec's Skip-gram/CBOW), pre-training + fine-tuning (such as BERT).

2. Mainstream Model Classification and Selection Guide

Embedding models convert text into numerical vectors that capture semantic information, enabling computers to understand and compare the "meaning" of content.

Considerations for selecting an Embedding model:

factorillustrate
Nature of the task
Matching task requirements (question answering, search, clustering, etc.)
Domain Characteristics
General vs. specialized fields (medicine, law, etc.)
Multi-language support
Consider when dealing with multilingual content
Dimensions
Balancing information richness and computational cost
License Terms
Open Source vs Proprietary Services
Maximum tokens
Suitable context window size


Best Practice: Test multiple embedding models for your specific application and evaluate their performance on real data rather than relying solely on general benchmarks.

We usually use the standard evaluation dataset SQuAD (validation set) of the question answering system for evaluation. SQuAD is an English dataset for reading comprehension question answering tasks released by Stanford University. It is mainly used to train and evaluate models: answer relevant questions based on a passage. Each sample in the dataset includes:

  • A paragraph (context)

  • A question

  • The correct answer is a passage from the article.


Download address:

https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

We evaluate and compare two Embedding models. The codes are as follows:

# embedding_model effect comparisonfrom  sentence_transformers  import  SentenceTransformer, utilimport  jsonimport  numpy  as  np
# Load SQuAD data (assuming it has been processed into list format)with  open ( r"D:\Test\LLMTrain\day18\data\squad_dev.json"as  f:    squad_data = json.load(f)[ "data" ]
# Extract question and answer pairsqa_pairs = []for  article  in  squad_data:    for  para  in  article[ "paragraphs" ]:        for  qa  in  para[ "qas" ]:            if  not  qa[ "is_impossible" ]:                qa_pairs.append({                    "question" : qa[ "question" ],                    "answer" : qa[ "answers" ][ 0 ][ "text" ],                    "context" : para[ "context" ]                })
# Initialize two local modelsmodel1 = SentenceTransformer( r'D:\Test\LLMTrain\testllm\llm\sentence-transformers\paraphrase-multilingual-MiniLM-L12-v2' )   # Model 1model2 = SentenceTransformer( r'D:\Test\LLMTrain\testllm\llm\sungw111\text2vec-base-chinese-sentence' )   # Model 2
# Encode all context (as vector library)contexts = [item[ "context"for  item  in  qa_pairs]context_embeddings1 = model1.encode(contexts)   # Vector library of model 1context_embeddings2 = model2.encode(contexts)   # Vector library of model 2

# Evaluation functiondef  evaluate ( model, query_embeddings, context_embeddings ):    correct =  0    for  idx, qa  in  enumerate (qa_pairs[: 100 ]):   # Test the first 100 items        # Find the most similar context        sim_scores = util.cos_sim(query_embeddings[idx], context_embeddings)        best_match_idx = np.argmax(sim_scores)        # Check if the answer is in the matching paragraph        if  qa[ "answer"in  contexts[best_match_idx]:            correct +=  1    return  correct /  len (qa_pairs[: 100 ])

# Encode all questionsquery_embeddings1 = model1.encode([qa [ "question"for  qa  in  qa_pairs[: 100 ]])query_embeddings2 = model2.encode([qa [ "question"for  qa  in  qa_pairs[: 100 ]])
# Execute evaluationacc1 = evaluate(model1, query_embeddings1, context_embeddings1)acc2 = evaluate(model2, query_embeddings2, context_embeddings2)
print ( f"Model 1 accuracy:  {acc1: .2 %} " )print ( f"Model 2 accuracy:  {acc2: .2 %} " )

The printed result is:

Model 1 accuracy: 47.00% Model 2 accuracy: 22.00%

From this, we can know the understanding ability of the two Embedding models in English scenarios.

There are many types of Embedding models. The following are some mainstream classifications and selection guides.

1. Universal all-rounder

BGE-M3: Developed by Beijing Zhiyuan Research Institute, it supports multi-language and hybrid retrieval (dense + sparse vectors), handles 8K contexts, and is suitable for enterprise-level knowledge bases.

NV-Embed-v2: Based on Mistral-7B, it has high retrieval accuracy (MTEB score 62.65), but requires higher computing resources.

2. Vertical field specialization

Chinese scenarios:  BGE-large-zh-v1.5 (contract/policy documents), M3E-base (social media analysis).

Multimodal scenarios:  BGE-VL (Body-Graphic-VL) jointly encodes OCR text and image features.

3. Lightweight deployment

nomic-embed-text: 768-dimensional vector, 3 times faster inference speed than OpenAI, suitable for edge devices.

gte-qwen2-1.5b-instruct:1.5B parameters, can run with 16GB video memory, suitable for prototype testing of startup teams.


Selection decision tree:

    1. Mainly Chinese → BGE series > M3E;

    2.  Multilingual requirements → BGE-M3 > multilingual-e5;

    3. Limited budget → Open source models (such as Nomic Embed);


3. Code Case Operation

1. Download Embedding from the MoTou community

#Model downloadfrom modelscope import snapshot_downloadmodel_dir = snapshot_download('BAAI/bge-m3',cache_dir=r"D:\Test\LLMTrain\testllm\llm")

2. Semantic similarity test

Installing Packages

pip install -U sentence-transformers

Logic code 

from sentence_transformers import SentenceTransformermodel = SentenceTransformer("BAAI/bge-m3") # or "all-MiniLM-L6-v2"sentences = ["This is a cat.", "This is a cat."]embeddings = model.encode(sentences)# Look at the similarityfrom sklearn.metrics.pairwise import cosine_similarityprint(cosine_similarity([embeddings[0]], [embeddings[1]])) # Result: 0.94xxx

Using the demo code from LlamaIndex:

from llama_index.embeddings.huggingface import HuggingFaceEmbeddingimport numpy as np# Load BGE Chinese embedding modelmodel_name = r"D:\Test\LLMTrain\testllm\llm\BAAI\bge-m3"embed_model = HuggingFaceEmbedding(model_name=model_name, device="cpu", # Use GPU as cuda, if there is no GPU, change to "cpu" normalize=True, # Normalize vector to facilitate calculation of cosine similarity)# Embed documentdocuments = ["What should I do if I forget my password?", "User account is locked"]doc_embeddings = [embed_model.get_text_embedding(doc) for doc in documents]# Embed query and calculate similarityquery = "Password reset process"query_embedding = embed_model.get_text_embedding(query)# Calculate cosine similarity (because normalize=True, dot product is cosine similarity) similarity = np.dot(query_embedding, doc_embeddings[0]) print(f"Similarity: {similarity:.4f}") # Output example: 0.7942