RAG Advanced: Embedding Models Principles and Selection

Master RAG advanced, starting with embedded models.
Core content:
1. Basic concepts and core principles of embedded models
2. Mainstream classification and selection guide of embedded models
3. Actual case: SQuAD dataset evaluation and code analysis
1. Concepts and core principles
1. The Nature of Embedding Models
Embedding Model is a technology that maps discrete data (such as text, images) to continuous vector space. Through high-dimensional vector representation (such as 768 dimensions or 3072 dimensions), the model can capture the semantic information of the data, making semantically similar texts closer in the vector space. For example, "forgot password" and "account locked" will be encoded as similar vectors, thus supporting semantic retrieval rather than just keyword matching.
2. Core role
Semantic encoding: Convert text, images, etc. into vectors, retaining contextual information (such as BERT's CLS Token or mean pooling.)
Similarity calculation: The correlation between vectors is measured by cosine similarity, Euclidean distance, etc., to support applications such as retrieval enhanced generation (RAG) and recommendation systems.
Information dimensionality reduction: compress complex data into low-dimensional dense vectors to improve storage and computing efficiency.
3. Key technical principles
Context dependence: Modern models (such as BGE-M3) dynamically adjust the vectors to capture the meaning of polysemous words in different contexts.
Training methods: contrastive learning (such as Word2Vec's Skip-gram/CBOW), pre-training + fine-tuning (such as BERT).
2. Mainstream Model Classification and Selection Guide
Embedding models convert text into numerical vectors that capture semantic information, enabling computers to understand and compare the "meaning" of content.
Considerations for selecting an Embedding model:
factor | illustrate |
---|---|
We usually use the standard evaluation dataset SQuAD (validation set) of the question answering system for evaluation. SQuAD is an English dataset for reading comprehension question answering tasks released by Stanford University. It is mainly used to train and evaluate models: answer relevant questions based on a passage. Each sample in the dataset includes:
A paragraph (context)
A question
The correct answer is a passage from the article.
Download address:
https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
We evaluate and compare two Embedding models. The codes are as follows:
# embedding_model effect comparison
from sentence_transformers import SentenceTransformer, util
import json
import numpy as np
# Load SQuAD data (assuming it has been processed into list format)
with open ( r"D:\Test\LLMTrain\day18\data\squad_dev.json" ) as f:
squad_data = json.load(f)[ "data" ]
# Extract question and answer pairs
qa_pairs = []
for article in squad_data:
for para in article[ "paragraphs" ]:
for qa in para[ "qas" ]:
if not qa[ "is_impossible" ]:
qa_pairs.append({
"question" : qa[ "question" ],
"answer" : qa[ "answers" ][ 0 ][ "text" ],
"context" : para[ "context" ]
})
# Initialize two local models
model1 = SentenceTransformer( r'D:\Test\LLMTrain\testllm\llm\sentence-transformers\paraphrase-multilingual-MiniLM-L12-v2' ) # Model 1
model2 = SentenceTransformer( r'D:\Test\LLMTrain\testllm\llm\sungw111\text2vec-base-chinese-sentence' ) # Model 2
# Encode all context (as vector library)
contexts = [item[ "context" ] for item in qa_pairs]
context_embeddings1 = model1.encode(contexts) # Vector library of model 1
context_embeddings2 = model2.encode(contexts) # Vector library of model 2
# Evaluation function
def evaluate ( model, query_embeddings, context_embeddings ):
correct = 0
for idx, qa in enumerate (qa_pairs[: 100 ]): # Test the first 100 items
# Find the most similar context
sim_scores = util.cos_sim(query_embeddings[idx], context_embeddings)
best_match_idx = np.argmax(sim_scores)
# Check if the answer is in the matching paragraph
if qa[ "answer" ] in contexts[best_match_idx]:
correct += 1
return correct / len (qa_pairs[: 100 ])
# Encode all questions
query_embeddings1 = model1.encode([qa [ "question" ] for qa in qa_pairs[: 100 ]])
query_embeddings2 = model2.encode([qa [ "question" ] for qa in qa_pairs[: 100 ]])
# Execute evaluation
acc1 = evaluate(model1, query_embeddings1, context_embeddings1)
acc2 = evaluate(model2, query_embeddings2, context_embeddings2)
print ( f"Model 1 accuracy: {acc1: .2 %} " )
print ( f"Model 2 accuracy: {acc2: .2 %} " )
The printed result is:
Model 1 accuracy: 47.00% Model 2 accuracy: 22.00%
From this, we can know the understanding ability of the two Embedding models in English scenarios.
There are many types of Embedding models. The following are some mainstream classifications and selection guides.
1. Universal all-rounder
BGE-M3: Developed by Beijing Zhiyuan Research Institute, it supports multi-language and hybrid retrieval (dense + sparse vectors), handles 8K contexts, and is suitable for enterprise-level knowledge bases.
NV-Embed-v2: Based on Mistral-7B, it has high retrieval accuracy (MTEB score 62.65), but requires higher computing resources.
2. Vertical field specialization
Chinese scenarios: BGE-large-zh-v1.5 (contract/policy documents), M3E-base (social media analysis).
Multimodal scenarios: BGE-VL (Body-Graphic-VL) jointly encodes OCR text and image features.
3. Lightweight deployment
nomic-embed-text: 768-dimensional vector, 3 times faster inference speed than OpenAI, suitable for edge devices.
gte-qwen2-1.5b-instruct:1.5B parameters, can run with 16GB video memory, suitable for prototype testing of startup teams.
Selection decision tree:
Mainly Chinese → BGE series > M3E;
Multilingual requirements → BGE-M3 > multilingual-e5;
Limited budget → Open source models (such as Nomic Embed);
3. Code Case Operation
1. Download Embedding from the MoTou community
#Model downloadfrom modelscope import snapshot_downloadmodel_dir = snapshot_download('BAAI/bge-m3',cache_dir=r"D:\Test\LLMTrain\testllm\llm")
2. Semantic similarity test
Installing Packages
pip install -U sentence-transformers
Logic code
from sentence_transformers import SentenceTransformermodel = SentenceTransformer("BAAI/bge-m3") # or "all-MiniLM-L6-v2"sentences = ["This is a cat.", "This is a cat."]embeddings = model.encode(sentences)# Look at the similarityfrom sklearn.metrics.pairwise import cosine_similarityprint(cosine_similarity([embeddings[0]], [embeddings[1]])) # Result: 0.94xxx
Using the demo code from LlamaIndex:
from llama_index.embeddings.huggingface import HuggingFaceEmbeddingimport numpy as np# Load BGE Chinese embedding modelmodel_name = r"D:\Test\LLMTrain\testllm\llm\BAAI\bge-m3"embed_model = HuggingFaceEmbedding(model_name=model_name, device="cpu", # Use GPU as cuda, if there is no GPU, change to "cpu" normalize=True, # Normalize vector to facilitate calculation of cosine similarity)# Embed documentdocuments = ["What should I do if I forget my password?", "User account is locked"]doc_embeddings = [embed_model.get_text_embedding(doc) for doc in documents]# Embed query and calculate similarityquery = "Password reset process"query_embedding = embed_model.get_text_embedding(query)# Calculate cosine similarity (because normalize=True, dot product is cosine similarity) similarity = np.dot(query_embedding, doc_embeddings[0]) print(f"Similarity: {similarity:.4f}") # Output example: 0.7942