From BGE to CLIP, from text to multimodality, the ultimate guide to embedding model selection

In-depth exploration of the comprehensive selection guide for Embedding models in multiple fields.
Core content:
1. The application value of Embedding models in semantic search, recommendation systems and other scenarios
2. Evaluation framework for selecting appropriate models based on task types and business needs
3. The impact of data characteristics on model selection and matching strategies
By converting raw input into a high-dimensional vector of a fixed size and capturing semantic information, embedding models play a vital role in building RAGs, recommendation systems, and even autonomous driving model training.
Even technology giants such as OpenAI, Meta, and Google have chosen to continue to increase their investment in the research and development of embedding models in recent years. Taking OpenAI as an example, its latest text-embedding-3-small generates 1536-dimensional vectors, while maintaining high semantic expression capabilities, achieving lower latency and smaller model size, which is suitable for large-scale semantic retrieval scenarios that are sensitive to performance. Meta launched DLRM (Deep Learning Recommendation Model) as early as 2019, building vector representations of users and items based on the embedding layer for advertising click-through rate prediction. DLRM is still the core component supporting the Meta recommendation system, serving its hundreds of billions of recommendation requests per day.
But how do we choose an embedding model? This article will provide a practical evaluation framework so that we can choose the most appropriate embedding model according to our needs.
01
Clarify mission and business requirements
Before choosing a model, you first need to clarify your core goals:
Task type : Are you building a semantic search, a recommendation system, a classification pipeline, or another type of application? Different tasks have different requirements for the way embedding represents information. For example:
Semantic search : Models such as Sentence-BERT are needed to capture the semantic details between queries and documents, making similar concepts close to each other in the vector space.
Classification tasks : embedding needs to reflect the category structure. Inputs of the same category should be closer to each other to facilitate the downstream classifier to distinguish. Commonly used models include DistilBERT and RoBERTa.
Recommendation system : embedding needs to reflect the association between users and items, and models based on implicit feedback training such as neural collaborative filtering (NCF) can be used.
ROI assessment : Weigh performance and cost based on the business context. For critical tasks such as medical diagnosis, where improving accuracy may mean the difference between life and death, it is acceptable to use a more expensive but more accurate model. However, for high-concurrency, cost-sensitive applications, careful calculations must be made to determine whether the performance improvement is worth the cost.
Other restrictions :
Multi-language support : Generic models tend to perform poorly for non-English content and may require multi-language-specific models.
Specialized domain support : General models cannot understand specific terms, such as "stat" in medicine or "consideration" in law. Specialized models such as BioBERT and LegalBERT need to be considered.
Hardware/latency requirements : Model size and inference speed directly impact the feasibility of deployment.
02
Evaluating data characteristics
The characteristics of your data will directly affect the model you choose. Consider the following points:
Data modality : text, image, audio or multi-modal? Choose a model that matches the data type:
Text: BERT, Sentence-BERT;
Image: CNN or Vision Transformer;
Multimodality: CLIP, MagicLens;
Audio: CLAP, PNN, etc.
Domain specificity : Do you need a specialized model? General models such as OpenAI perform well on general topics, but may not capture nuances in professional scenarios such as medicine and law. You need to consider industry-specific models such as BioBERT.
Embedding type selection :
Sparse embedding (such as BM25) is good at keyword matching;
Dense embedding (such as BERT) is good at semantic understanding;
A hybrid solution is often used in practice : using sparse embedding for exact matching and dense embedding for semantic recall.
03
Research available models
After understanding the task and data, start investigating candidate models:
Popularity : It is safer to choose a model with an active community, widespread use, easy to troubleshoot, fast updates, and rich documentation.
Text: OpenAI embeddings, Sentence-BERT, E5/BGE;
Image: ViT, ResNet, text image alignment can use CLIP, SigLIP;
Audio: PNN, CLAP, etc.
Copyright and License :
The open source model (MIT, Apache 2.0) is suitable for self-built deployment, with high flexibility but requiring operation and maintenance capabilities;
The third-party API model is simple to deploy but has ongoing costs, data privacy and compliance concerns;
Especially in industries such as finance and healthcare, self-hosted deployment may be the only option.
04
Evaluating candidate models
After the initial screening, the model quality needs to be tested on real data:
Quality Assessment :
For semantic retrieval and RAG applications, the focus is on: faithfulness , relevance , contextual precision and recall .
The evaluation process can be unified with the help of tools such as Ragas, DeepEval, Phoenix, and TruLens-Eval.
The dataset selection is also very important and can be based on real cases, synthesized using LLM, or constructed using tools (such as Ragas, FiddleCube).
Benchmarks :
You can refer to public benchmarks such as MTEB (for semantic retrieval).
Note that the rankings in different scenarios vary greatly, and good performance on general benchmarks does not necessarily mean good performance in the real environment.
Use your own sample tests to prevent the model from overfitting to the benchmark and performing poorly on actual data.
Load Testing :
When you deploy the model yourself, you need to simulate real concurrent requests and test GPU utilization, memory usage, throughput, and latency.
Some models perform well in stand-alone testing, but consume too much resources under high load, affecting their launch.
Generally speaking, the more common benchmark test lists are as follows:
(1) Text data: MTEB rankings
HuggingFace's MTEB leaderboard is a one-stop text embedding model list where we can understand the average performance of each model.
You can sort the "Retrieval Average" column in descending order, as this best fits the task of vector search. Then, find the model with the highest ranking and the smallest memory footprint.
The embedding vector dimension is the length of the vector, i.e., the y in f(x)=y, which the model will output.
The maximum number of tokens is the length of the input text block, i.e., the x in f(x)=y, that you can input into the model.
In addition to sorting by Retrieval task, you can also filter by the following criteria:
Language: Supports French, English, Chinese, Polish. (For example: task=retrieval, Language=chinese).
Legal domain text. (For example: task=retrieval, Language=law)
(2) Image data: ResNet50
Sometimes we may want to search for images that are similar to the input image. For example, when we want to find more images of Scottish Fold cats. In this case, you can upload a picture of a Scottish Fold cat and ask the search engine to find similar images.
ResNet50 is a popular CNN model originally trained by Microsoft in 2015 using ImageNet data.
Similarly, for video search, ResNet50 can still convert the video into an Embedding vector. Then, a similarity search is performed on the static video frames, and the most similar video is returned to the user as the best match result.
(3) Audio data: PANNs
Similar to image search, you can also search for similar audio based on the input audio clip.
PANNs (Pre-trained Audio Neural Networks) are commonly used audio search embedding models because PANNs are pre-trained based on large-scale audio datasets and are good at tasks such as audio classification and tagging.
(4) Multimodal image and text data: SigLIP
In recent years, a number of embedding models have emerged that are trained on a mix of various unstructured data (text, images, audio, or video). These models can simultaneously capture the semantics of multiple types of unstructured data in the same vector space.
The multimodal embedding model supports searching images using text, generating text descriptions for images, or searching images by image.
CLIP, launched by OpenAI in 2021, is a standard Embedding model. However, it is not easy to use because it requires users to fine-tune it themselves. Therefore, in 2024, Google launched SigLIP (Sigmoidal-CLIP). This model achieved good performance when using zero-shot prompt.
(5) Multimodal text, audio, and video data
Multimodal text-audio RAG systems mostly use multimodal generative LLM. Such applications first convert audio to text, generate audio-text pairs, and then convert text to embedding vectors. After that, we can use RAG to retrieve text as usual. In the last step, the text is mapped back to audio.
OpenAI's Whisper can transcribe speech into text. In addition, OpenAI's Text-to-speech (TTS) model can also convert text into audio.
The multimodal text-video RAG system uses a similar approach by first mapping the video to text, converting to embedding vectors, searching the text, and returning the video as the search result.
OpenAI’s Sora can turn text into video. Similar to Dall-e, you provide text prompts and LLM generates the video. Sora can also generate videos from static images or other videos.
05
Integration deployment planning
After selecting a model, consider the ensemble strategy:
Weight selection : Directly using pre-trained weights is quick to get started, but if field customization is required, resources need to be invested in fine-tuning. Although fine-tuning can improve the effect, its input-output ratio needs to be evaluated.
Deployment method selection :
Self-hosting : strong control, can reduce the cost of large-scale use, good data privacy, but requires operation and maintenance capabilities;
Cloud service API : fast deployment and worry-free operation and maintenance, but there are problems with network latency and cost accumulation.
System integration design :
Including API design, caching strategy, and batch processing solutions;
Choose a suitable vector database to store and retrieve embeddings, such as Milvus, Faiss, etc.
06
End-to-end testing
Before production goes online, be sure to conduct a closed-loop test:
Performance Verification :
Use actual business data to verify whether it meets expectations;
Check retrieval-related metrics (MRR, MAP, NDCG), accuracy metrics (Precision, Recall, F1), and operational efficiency (throughput, P95/P99 latency).
Robustness test :
Simulate different input conditions to ensure that the model can stably handle edge cases and complex data.
If necessary, we can evaluate the Embedding model on our own dataset. The following is an example of the Embedding model process:
The dataset is prepared as follows:
Next, we usepymilvus[model]
Generate the corresponding vector embedding for the above dataset. For the use of `pymilvus[model]`, see https://milvus.io/blog/introducing-pymilvus-integrations-with-embedding-models.md
def gen_embedding(model_name): openai_ef = model.dense.OpenAIEmbeddingFunction( model_name=model_name, api_key=os.environ["OPENAI_API_KEY"] ) docs_embeddings = openai_ef.encode_documents(df['description'].tolist()) return docs_embeddings, openai_ef
Then, store the generated Embedding into Milvus collection.
def save_embedding(docs_embeddings, collection_name, dim): data = [ {"id": i, "vector": docs_embeddings[i].data, "text": row.language} for i, row in df.iterrows() ] if milvus_client.has_collection(collection_name=collection_name): milvus_client.drop_collection(collection_name=collection_name) milvus_client.create_collection(collection_name=collection_name, dimension=dim) res = milvus_client.insert(collection_name=collection_name, data=data)
Query
We define a query function to facilitate recall of vector embeddings.
def query_results(query, collection_name, openai_ef): query_embeddings = openai_ef.encode_queries(query) res = milvus_client.search( collection_name=collection_name, data=query_embeddings, limit=4, output_fields=["text"], ) result = {} for items in res: for item in items: result[item.get("entity").get("text")] = item.get('distance') return result
Evaluating Embedding Model Performance
We use two Embedding models from OpenAI.text-embedding-3-small
and text-embedding-3-large
, compare the following two queries. There are many evaluation indicators, such as precision, recall, MRR, MAP, etc. Here, we use precision and recall.
Precision evaluates the proportion of truly relevant content in the search results, that is, how many of the returned results are relevant to the search query.
Precision = TP / (TP + FP)
Among them, True Positives (TP) are the content in the search results that are truly relevant to the query, while False Positives (FP) refer to the irrelevant content in the search results.
Recall evaluates the amount of relevant content successfully retrieved from the entire dataset.
Recall = TP / (TP + FN)
Among them, False Negatives (FN) refers to all relevant items that are not included in the final result set.
For a more detailed explanation of these two concepts, see https://zilliz.com/learn/information-retrieval-metrics
Query 1 :auto garbage collection
Related: Java, Python, JavaScript, Golang
Rank | text-embedding-3-small | text-embedding-3-large |
1 | ❎ Rust | ❎ Rust |
2 | ❎ C/C++ | ❎ C/C++ |
3 | ✅ Golang | ✅ Java |
4 | ✅ Java | ✅ Golang |
Precision | 0.50 | 0.50 |
Recall | 0.50 | 0.50 |
Query 2 : suite for web backend server development
Related terms: Java, JavaScript, PHP, Python (answers are subjective)
Rank | text-embedding-3-small | text-embedding-3-large |
1 | ✅ PHP | ✅ JavaScript |
2 | ✅ Java | ✅ Java |
3 | ✅ JavaScript | ✅ PHP |
4 | ❎ C# | ✅ Python |
Precision | 0.75 | 1.0 |
Recall | 0.75 | 1.0 |
In these two queries, we compared the two embedding models by precision and recall. text-embedding-3-small
and text-embedding-3-large
We can use this as a starting point and increase the number of data objects in the dataset as well as the number of queries to more effectively evaluate the Embedding model.
Summarize
The key to selection is to follow these six steps:
Clarify business objectives and task types
Analyze data characteristics and domain requirements
Research existing models and licensing models
Rigorous evaluation using test sets and benchmarks
Design, deployment and integration solutions
Conduct full-link pre-launch testing
Remember, the most suitable model is not necessarily the one with the highest benchmark score, but the model that best fits your actual business needs and technical constraints .
In an era of rapid iteration of embedding models, it is also recommended that you regularly review existing selections, continue to pay attention to new models and technologies, and promptly replace solutions that may bring significant benefits.