Woter AI detection.Hurry - ends Jul 20th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

What is an Embedding vector model? How should we choose one?

Written by

Jasper Cole

Updated on:July-13th-2025

The local knowledge base we talked about before is basically built using the Retrieval Enhanced Generation (RAG) technology. The Embedding model is the core of RAG and is also an indispensable technology for the implementation of large models. So today we will talk about the Embedding vector model:

1. What is the Embedding model?

The Embedding model is a technology that maps discrete data (such as text, images, user behavior, etc.) to a continuous vector space. Its core idea is to capture the intrinsic features and semantic relationships of data through low-dimensional dense vectors (Embedding).

To put it simply, Embedding puts a "digital coat" on the data, converting discrete data (such as text, images, user behavior, etc.) into a set of numbers, namely vectors, in order to make it easier for machines to understand and process it.

In other words, the Embedding model is a "translator" that can convert text, pictures, user behavior and other things into a string of numbers that computers can understand.

For example: it knows that "cat" and "dog" are both pets, so their number strings are very similar; but "cat" and "watermelon" are very different, so their number strings are far apart.

What is it used for? For example, you can let your phone understand what you say, or let Taobao guess what products you like.

In building the local knowledge base, our focus is on text embedding, which can transform text information into a vector model that is easier for computers to understand and process.

2. Embedding model principle

The principle of the Embedding model is like a " guessing word game ":

For example, when you are chatting with a friend, and your friend says, "I bought a __ today, and it meows," you will definitely guess it is "cat." The Embedding model learns which words often appear together through a large number of such sentences, and then turns them into similar strings of numbers.

Method example:

Word2Vec: Like elementary school students memorizing words, remembering that "apple" and "fruit" are related.

BERT (e.g. the basis of ChatGPT): like a top student, guessing words based on the context, for example, "bank" has different meanings in the sentences "save money" and "by the river".

3. Embedding model vectorization method

Common vectorization methods include:

Based on word frequency/co-occurrence	1. TF-IDF: based on word frequency statistics. 2. LSA: Perform singular value decomposition (SVD) on the word-document matrix.
Based on neural network	1. Word2Vec: Skip-Gram/CBOW (suitable for general semantics). 2. FastText: Introduces subword information to solve the problem of unregistered words. 3. GloVe: Combining global word co-occurrence statistics with local context windows.
Context-sensitive model	BERT: Dynamically generates context-based vectors (e.g. “bank” is different in “riverbank” and “bank”).
Graph Embedding	Node2Vec: Map graph nodes to vectors, preserving network structure characteristics. 2.

Non-textual domains: For example, matrix factorization (MF) is used in recommendation systems to generate user/item vectors.

4. Role of Embedding Model

The role of the Embedding model in the local knowledge base can be understood as using a "digital password" to convert local files (such as documents, notes, and databases) into a form that can be understood by computers, thereby quickly realizing functions such as search, question-and-answer, and classification.

1. Local data vectorization

Convert documents in the local knowledge base (such as PDF, Word, Excel, notes) into vectors to facilitate subsequent searches.

Data preprocessing

Chunking	Split long documents into smaller paragraphs (e.g. 200-500 words per paragraph) to avoid information overload. Example : A 10-page “Coffee Research Report” → Split into 30 smaller paragraphs.
Cleaning	Remove garbled characters, special symbols, advertisements and other irrelevant content.

Select Embedding Model

General scenarios	Using lightweight models (such as `text2vec`,`m3e-base`), suitable for local operation.
Professional scenes	Fine-tune the model by domain (e.g. Law-Embedding for law, BioBERT for medicine )

Batch generation of vectors

Call the Embedding model for each text paragraph to generate a corresponding vector (a string of numbers).

Tool examples:

# Pseudocode: Generate vectors with text2vecfrom text2vec import SentenceModelmodel = SentenceModel("shibing624/text2vec-base-chinese")text = "Caffeine can refresh you, but excessive intake can cause palpitations."vector = model.encode(text) # Output example: [0.3, -0.2, 0.8, ...]

Storage vector + original text

Vector Database

Store the vector and the corresponding original text in association.

Recommended tools

Lightweight: FAISS (suitable for small data, fast retrieval).

Large scale: Milvus , Qdrant (supports millions of data).

Storage Examples

2. User Input Vectorization

Accepting User Input

User input can be text, speech-to-text, or even images (which must first be converted to text using a multimodal model).

Sample input:

Text: "What are some refreshing drinks?"

Voice: "Help me find articles about coffee."

Input and processing

Error correction: Correct typos (e.g. "提神秦料" → "提神饮")

Concise: Extract key information (e.g. “Can you tell me how to quickly wake up?” → “Quick ways to wake up”).

Generate user input vector

Use the same Embedding model to convert user input into a vector.

Code example:

user_query = "What are the refreshing drinks?" query_vector = model.encode(user_query) # Generate vectors, such as [0.25, -0.3, 0.75, ...]

Similarity calculation and matching

In the vector database, quickly find the local data vector closest to the user vector.

algorithm:

Cosine similarity (most commonly used, calculates directional similarity).

Euclidean distance (calculate the absolute distance between vectors).

Tool example (using FAISS):

# Pseudocode: FAISS search for similar vectors distances, indices = index.search(query_vector, k=5) # Find the 5 closest results for idx in indices: print("Matching content:", original library[idx])

Return results

Based on the matched vector, the corresponding original text or summary is returned.

Optimization tips:

Multi-way recall: Match multiple similar paragraphs simultaneously to improve coverage.

Reranking: Use a more sophisticated model (such as a cross encoder) to re-rank the results.

3. Complete process diagram

User input → Vectorization → Search vector library → Match the most similar local data → Return results ↑ ↑ Same model Same vector space

Summary

The usage of the Embedding model in the local knowledge base is a two-step process:

a. Convert local files into digital vectors (build a database).

b. Convert user questions into digital vectors (search).

By comparing the two in the "digital world", they can understand semantics like humans and achieve accurate question-answering and search!

5. How to choose the Embedding model

In RAG (Retrieval-Augmented Generation) applications, it is crucial to choose a suitable Embedding model. The following factors need to be considered when choosing:

1. Application scenarios

First, it is necessary to clarify what type of data the RAG system will process, whether it is text, image, or multimodal data. Different data types may require different Embedding models.

For example, for text data, you can refer to HuggingFace's MTEB (Massive Text Embedding Benchmark) ranking to choose a suitable model; for multimodal needs, such as image and text retrieval, you can choose CLIP or ViLBERT .

2. General and specific field requirements

Secondly, you can choose the appropriate model based on whether the task is general or specific. If the task involves general knowledge, you can choose a general Embedding model; if the task involves a specific field (such as law, medicine, education, finance, etc.), you need to choose a model that is more suitable for that field.

3. Multilingual requirements

If the system needs to support multiple languages, you can choose an Embedding model that supports multiple languages, such as BAAI/bge-M3, bce_embedding (Chinese and English), etc. If the knowledge base mainly contains Chinese data, you can choose a model such as iic/nlp_gte_sentence-embedding_chinese-base .

4. Performance evaluation

Check out benchmarking frameworks such as the MTEB leaderboard to evaluate the performance of different models. These leaderboards cover a variety of languages and task types, which can help you find the best performing model for a specific task. At the same time, consider the size of the model and resource constraints. Larger models may provide higher performance, but will also increase computational costs and memory requirements.

5. Embedding model recommendation

Recommend some excellent Embedding models:

Text data: refer to HuggingFace's MTEB rankings or the domestic MoDa community rankings.

Multi-language support: select models such as BAAI/bge-M3, bce_embedding (Chinese and English), etc.

Chinese data: Select a model such as iic/nlp_gte_sentence-embedding_chinese-base.

The Embedding model is a bridge between symbolic data and numerical computing. Its selection needs to be combined with the actual task requirements and data characteristics. With the development of pre-trained models, dynamic and multimodal Embedding is becoming a trend.