What is an Embedding vector model? How should we choose one?

Embedding vector model - a bridge to understand data and help intelligent machine processing.
Core content:
1. Definition of Embedding model and its role in data representation
2. Working principle and application scenarios of Embedding model
3. Introduction and comparison of different Embedding vectorization methods
For example: it knows that "cat" and "dog" are both pets, so their number strings are very similar; but "cat" and "watermelon" are very different, so their number strings are far apart.
What is it used for? For example, you can let your phone understand what you say, or let Taobao guess what products you like.
The principle of the Embedding model is like a " guessing word game ":
For example, when you are chatting with a friend, and your friend says, "I bought a __ today, and it meows," you will definitely guess it is "cat." The Embedding model learns which words often appear together through a large number of such sentences, and then turns them into similar strings of numbers.
Method example:
Word2Vec: Like elementary school students memorizing words, remembering that "apple" and "fruit" are related.
BERT (e.g. the basis of ChatGPT): like a top student, guessing words based on the context, for example, "bank" has different meanings in the sentences "save money" and "by the river".
Common vectorization methods include:
Based on word frequency/co-occurrence | 1. TF-IDF: based on word frequency statistics. 2. LSA: Perform singular value decomposition (SVD) on the word-document matrix. |
Based on neural network | 1. Word2Vec: Skip-Gram/CBOW (suitable for general semantics). 2. FastText: Introduces subword information to solve the problem of unregistered words. 3. GloVe: Combining global word co-occurrence statistics with local context windows. |
Context-sensitive model | BERT: Dynamically generates context-based vectors (e.g. “bank” is different in “riverbank” and “bank”). |
Graph Embedding | Node2Vec: Map graph nodes to vectors, preserving network structure characteristics. 2. |
Non-textual domains: For example, matrix factorization (MF) is used in recommendation systems to generate user/item vectors.
text2vec ,m3e-base ), suitable for local operation. | |
Fine-tune the model by domain (e.g. Law-Embedding for law, BioBERT for medicine ) |
# Pseudocode: Generate vectors with text2vecfrom text2vec import SentenceModelmodel = SentenceModel("shibing624/text2vec-base-chinese")text = "Caffeine can refresh you, but excessive intake can cause palpitations."vector = model.encode(text) # Output example: [0.3, -0.2, 0.8, ...]
Lightweight: FAISS (suitable for small data, fast retrieval). Large scale: Milvus , Qdrant (supports millions of data). | |
Accepting User Input
User input can be text, speech-to-text, or even images (which must first be converted to text using a multimodal model).
Sample input:
Text: "What are some refreshing drinks?"
Voice: "Help me find articles about coffee."
Input and processing
Error correction: Correct typos (e.g. "提神秦料" → "提神饮")
Concise: Extract key information (e.g. “Can you tell me how to quickly wake up?” → “Quick ways to wake up”).
Generate user input vector
Use the same Embedding model to convert user input into a vector.
Code example:
user_query = "What are the refreshing drinks?" query_vector = model.encode(user_query) # Generate vectors, such as [0.25, -0.3, 0.75, ...]
Similarity calculation and matching
In the vector database, quickly find the local data vector closest to the user vector.
algorithm:
Cosine similarity (most commonly used, calculates directional similarity).
Euclidean distance (calculate the absolute distance between vectors).
Tool example (using FAISS):
# Pseudocode: FAISS search for similar vectors distances, indices = index.search(query_vector, k=5) # Find the 5 closest results for idx in indices: print("Matching content:", original library[idx])
Return results
Based on the matched vector, the corresponding original text or summary is returned.
Optimization tips:
Multi-way recall: Match multiple similar paragraphs simultaneously to improve coverage.
Reranking: Use a more sophisticated model (such as a cross encoder) to re-rank the results.
User input → Vectorization → Search vector library → Match the most similar local data → Return results ↑ ↑ Same model Same vector space
Summary
The usage of the Embedding model in the local knowledge base is a two-step process:
a. Convert local files into digital vectors (build a database).
b. Convert user questions into digital vectors (search).
By comparing the two in the "digital world", they can understand semantics like humans and achieve accurate question-answering and search!
In RAG (Retrieval-Augmented Generation) applications, it is crucial to choose a suitable Embedding model. The following factors need to be considered when choosing:
First, it is necessary to clarify what type of data the RAG system will process, whether it is text, image, or multimodal data. Different data types may require different Embedding models.
For example, for text data, you can refer to HuggingFace's MTEB (Massive Text Embedding Benchmark) ranking to choose a suitable model; for multimodal needs, such as image and text retrieval, you can choose CLIP or ViLBERT .
Secondly, you can choose the appropriate model based on whether the task is general or specific. If the task involves general knowledge, you can choose a general Embedding model; if the task involves a specific field (such as law, medicine, education, finance, etc.), you need to choose a model that is more suitable for that field.
If the system needs to support multiple languages, you can choose an Embedding model that supports multiple languages, such as BAAI/bge-M3, bce_embedding (Chinese and English), etc. If the knowledge base mainly contains Chinese data, you can choose a model such as iic/nlp_gte_sentence-embedding_chinese-base .
Check out benchmarking frameworks such as the MTEB leaderboard to evaluate the performance of different models. These leaderboards cover a variety of languages and task types, which can help you find the best performing model for a specific task. At the same time, consider the size of the model and resource constraints. Larger models may provide higher performance, but will also increase computational costs and memory requirements.
Recommend some excellent Embedding models:
Text data: refer to HuggingFace's MTEB rankings or the domestic MoDa community rankings.
Multi-language support: select models such as BAAI/bge-M3, bce_embedding (Chinese and English), etc.
Chinese data: Select a model such as iic/nlp_gte_sentence-embedding_chinese-base.
The Embedding model is a bridge between symbolic data and numerical computing. Its selection needs to be combined with the actual task requirements and data characteristics. With the development of pre-trained models, dynamic and multimodal Embedding is becoming a trend.