How to choose an Embedding Model? 10 thoughts on embedding models

Written by
Clara Bennett
Updated on:June-20th-2025
Recommendation

Explore the core role and selection strategy of embedding models in large model applications.

Core content:
1. The importance and function of embedding models in the RAG framework
2. The impact of context windows on model performance
3. The application of position embedding technology in processing long texts

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

In large model applications, especially those based on the RAG framework, the embedding model is an indispensable key component. Here are 10 thoughts on the embedding model in practice, hoping to be helpful to you.

1. Importance of Embedding Models in RAG

Embedding models convert text into numerical vectors, which allow computers to process, compare, and retrieve information more efficiently. These vectors can capture the meaningful connections between words, phrases, and even entire documents, making embedding models a key tool in a variety of natural language processing tasks.

In the Retrieval-Augmented Generation (RAG) system, the embedding model plays a central role in finding and ranking the most relevant information from the knowledge base to the user's query. When a user asks a question, the embedding model compares text vectors to find the best matching document. Choosing the right embedding model is crucial to ensuring that the retrieval results are both accurate and meaningful, so that the resulting answers will be more accurate and useful.

For example, in a RAG system in the legal field, if an embedding model trained specifically for legal terminology is used, the system can better find legal documents relevant to the query and ensure that the cited case law materials are both accurate and contextual. This precision is especially important for work scenarios that require a high degree of professionalism and accuracy, such as legal research or medical literature analysis. In this way, the embedding model not only improves the quality of information retrieval, but also enhances the practicality of the entire system.

2. How context is handled in the embedding model structure

The context window refers to the maximum amount of text that the embedding model can process at one time, that is, the number of words or subwords it can consider. This parameter affects how much content the model can include when generating text representations.

A larger context window means that the model can process longer paragraphs without worrying about information being truncated. This is important for tasks that require understanding long documents, such as analyzing research papers, legal documents, or academic transcripts.

For example, when performing semantic search, if the model has a small context window, it may miss important information in the later part of the document. In contrast, a model with a larger context window can capture the broad meaning of the entire document, thereby providing more accurate search results.

In practice, different models support different context lengths. Some older models may be limited to processing only 512 tokens, but newer models are capable of processing thousands of tokens, making them well suited for complex tasks such as summarizing long articles or extracting information from detailed documents.

Transformer-based embedding models (such as BERT) are different from traditional recurrent neural networks (RNNs) and do not have inherent order awareness. To compensate for this, these models use positional embedding to record the position information of each word:

  • Absolute position embeddings : directly assign a specific position value to each token in the sequence (for example, BERT uses a sine function to achieve this), so that the model can understand the specific position of each word in the sentence.

  • Relative position embedding : Instead of focusing on the specific position of words, it focuses on the relative distance between words (for example, the T5 model uses this approach). This approach helps to better understand the relationship between words instead of relying solely on the order in which they appear.

This accurate capture of word order is especially important for processing long texts, as it ensures that even words that are far apart in the document can be correctly understood and associated, which is crucial for improving the accuracy of text retrieval and document ranking. In this way, the model can not only understand the meaning of individual words, but also grasp the structure and logic of the entire document.


3. The impact of tokenization mechanism on embedding model

Tokenization is the process of breaking text into smaller units called tokens, which can be single words, parts of words, or even single characters. This is an important preprocessing step before embedding models process text, as it is directly related to how text is converted into numerical form.

Different Tokenization methods have a great impact on the effectiveness of embedding models in processing various texts:

  • Word-level Tokenization : This approach treats each word as a separate token. However, it has difficulty dealing with newly coined or rare words, as these words may not be in the model’s known vocabulary.

  • Subword Tokenization (such as Byte-Pair Encoding or WordPiece): This technique breaks words into smaller parts or subwords. For example, "unhappiness" might be split into "un", "happi", and "ness". This approach allows the model to better cope with out-of-vocabulary words, so it is very popular in modern models. It cleverly balances vocabulary size and flexibility, allowing the model to handle a wide range of words from everyday language to professional terms without making the vocabulary too large.

  • Character-level Tokenization : Here each character is treated as a separate token. This approach is particularly useful for languages ​​that have complex morphology but can be expressed with fewer characters, although doing so will make the sequence longer.

Choosing the right tokenization method is crucial for the embedding model to effectively handle domain-specific languages, professional terms, or multilingual texts. For example, in healthcare applications, it is particularly important to use subword-level tokenization methods to ensure that the model accurately understands and processes professional terms such as "myocardial infarction". Correctly choosing a tokenization strategy can help improve model performance and make it more suitable for specific application scenarios.

4. Impact of Embedding Model Dimension on Performance

Dimensionality refers to the number of values ​​the model generates for each embedding, which determines how much information these vectors can contain.

  • Low-dimensional embeddings (such as 128 or 256 dimensions) are computationally efficient and fast to process, but may not be as detailed in expressing semantics as high-dimensional embeddings, which may affect accuracy in some tasks. They are suitable for scenarios that require high speed and efficiency.

  • High-dimensional embeddings (such as 768 or 1024 dimensions) can capture more subtle semantic relationships and provide more powerful expressive power. However, they require more computing resources and memory support, which means higher costs and slower processing speeds. High-dimensional embeddings can express the meaning of text more finely, but they are challenging to use with limited resources.

For extremely high-dimensional embeddings such as more than 1024 dimensions, although it provides very rich semantic representation, it also brings some problems:

  • Increased computational cost
    : Storing and processing these high-dimensional vectors requires more memory and more computing power.
  • The curse of dimensionality
    : As the dimension increases, comparing similarities in high-dimensional space becomes difficult because the differences between distances are difficult to distinguish.
  • Slower retrieval times
    : Without optimization, searching large embedding databases can become quite time consuming.

To combat these issues, there are several mitigation strategies:

  • Using dimensionality reduction techniques, such as PCA (Principal Component Analysis) or t-SNE, can reduce the computational burden while retaining key information.
  • Using efficient vector search techniques, like FAISS (Facebook AI Nearest Neighbor Search) or HNSW (Hierarchical Navigable Small World), can significantly speed up retrieval.

Choosing the right dimension depends on the specific application requirements. For real-time applications, such as chatbots or voice assistants, low-dimensional embeddings are usually a better choice, because such scenarios value speed and efficiency more. For tasks that require high precision, such as document similarity analysis, higher-dimensional embeddings are more suitable because they ensure a more accurate description of complex text content. In this way, by weighing the pros and cons of different dimensions, the most suitable solution can be found according to actual needs.

5. The impact of vocabulary size on embedding models

The vocabulary size of an embedding model determines the number of unique words or tokens it can recognize and process. A larger vocabulary can improve the accuracy of the model because it can understand a wider range of words, including domain-specific terms and expressions in multiple languages. However, it also means that more memory and computing resources are required.

  • Advantages of a large vocabulary : Having a large vocabulary allows the model to better understand and represent a variety of words, especially those domain-specific terms or words from different languages. This is especially important for application scenarios such as scientific research or multilingual literature retrieval, because they often need to deal with a large amount of professional terminology or cross-lingual information.

  • Small vocabulary : If the vocabulary is small, it can reduce the required memory and speed up the processing. However, this may cause the model to perform poorly when encountering uncommon words or domain-specific terms.

For example, in natural language processing models in the biomedical field, a larger vocabulary is essential to accurately understand and use medical terms. On the other hand, for customer service chatbots, since they mainly deal with common questions in daily conversations, a smaller vocabulary is sufficient and can also ensure the speed and efficiency of the response.

In summary, when your application covers a wide range of topics, multiple languages, or contains a lot of specialized terms, it is more advantageous to choose a larger vocabulary. But be aware that doing so will also increase memory requirements, which may become a challenge when resources are limited. Therefore, when choosing a model, you need to balance the relationship between vocabulary size and resource constraints based on the specific usage scenario.

6. The impact of training data on embedding models

The training data used when developing an embedding model has a significant impact on its performance because it determines what kind of language and knowledge domains the model can understand.

If a model is trained on broad, general internet material (e.g., Wikipedia, news articles), it may perform well in everyday conversations, but may not perform well in specialized domains such as finance, law, or medicine. In contrast, if a model is trained on a domain-specific dataset, such as medical journals for healthcare applications, it will perform better in that specific domain.

The quality and diversity of training data are critical to the performance of the model. High-quality and diverse training data can significantly improve the model's knowledge and processing capabilities.

Fine-tuning on domain-specific data can enhance the embedding model’s understanding of specialized terminology and contextual nuances. Benefits include:

  • Improve search accuracy
    :The model can more accurately find documents that meet the query intent.
  • Better grasp of terminology
    : Learn and understand domain-specific terms that are not adequately represented in general models.
  • Reducing bias
    : Various bias problems existing in general models can be reduced through fine-tuning.

For example, a legal document retrieval system can benefit from legal case law and regulations by using a model fine-tuned on legal texts, ensuring that search results are legally relevant rather than broad, general information.

Therefore, when choosing an embedding model, it is important to consider whether its training data matches the intended application scenario. For example, a team building a legal document retrieval system should choose models that have been trained on legal cases and regulations to ensure the accuracy and applicability of search results. This will not only improve work efficiency, but also ensure the relevance and accuracy of the content.

7. Cost and deployment options for embedded models

There are several cost factors to consider when picking an embedding model:

  1. Infrastructure costs: Running the embedding model requires computing resources such as GPUs or cloud servers. The cost depends on the hardware configuration you choose and the duration of use.

  2. API costs: Some business models, such as those provided by OpenAI and Cohere, charge based on the number of tokens processed, which means that as usage increases, the cost will also increase accordingly.

  3. Storage and memory costs. Large, high-dimensional embedding models require more storage space and memory to run, which places higher demands on resources and naturally increases costs.

  4. Inference cost. When you are performing inference on a large dataset, especially when the embeddings need to be updated frequently, the process can be quite expensive.

For example, if you are a startup building a search engine, you may prefer to choose an open source embedded model to reduce API costs. On the contrary, large enterprises with abundant computing resources may prefer to choose a proprietary model with superior performance but high price, because they pursue the highest accuracy and efficiency, and are less concerned about cost.

Among them, the API-based model is very convenient and fast to use, but in the long run, especially for applications that require a lot of use, the cost may become very high. On the other hand, the open source model is more affordable, but requires users to have higher technical knowledge and need to build and maintain the relevant infrastructure by themselves. In this way, the choice of which model should not only consider budget constraints, but also be determined by one's own technical capabilities and actual needs.

8. Performance Evaluation Metrics for Embedding Models

The quality of the embedding models is evaluated using a variety of benchmarks and testing methods:

  1. First up is MTEB: This is a very popular evaluation framework for testing the performance of embedding models on different natural language processing tasks, such as semantic search, classification, and clustering. A higher score generally means that the model performs better in these tasks.

  2. Intrinsic evaluation: This method tests whether the embeddings can accurately capture the meaning of words through tasks such as word similarity.

  3. External evaluation: focuses on examining the performance of the embedding model in practical applications, such as search ranking, recommendation systems, and question-answering tasks.

  4. Custom testing: This means running a test on your own dataset to ensure it meets your specific needs. For example, a law firm focused on legal literature retrieval needs to evaluate the model’s ability to accurately retrieve information based on case law; while an e-commerce company optimizing product recommendations is more concerned with how the embedding model affects customer engagement.

In addition, cosine distance is an indicator to measure the similarity between two vectors. It determines the degree of similarity between them by calculating the cosine value of the angle between the two vectors. In the embedding model, this indicator is used to determine whether two texts are semantically similar. The range of cosine distance is from -1 to 1, and the specific meaning is as follows:

  • 1 means that the two vectors are in the same direction, which means they are highly similar;
  • 0 means that the two vectors are perpendicular to each other, which means there is no similarity;
  • -1 means the two vectors are in opposite directions.

Cosine distance is widely used in semantic search and RAG system's document retrieval function to sort relevant documents according to their proximity to the query. This allows efficient finding of the most relevant documents to the query.


9. Applicable scenarios of different embedding types

Static embedding is like attaching a fixed label to each word, no matter how the word is used in different sentences or paragraphs. For example, tools such as Word2Vec, GloVe and FastText do this. Although this method can show the relationship between words, it cannot identify the different meanings of the same word in different situations. For example, the word "bank" is confused when it refers to the "river bank" by the river and the "bank" as a financial institution.

Contextual word embeddings are smarter, such as BERT, RoBERTa, and Sentence Transformers, which dynamically generate representations based on the text around the word, allowing them to understand the multiple meanings of a word in different scenarios. This makes such models perform better in tasks such as RAG retrieval, semantic search, and text summarization.

Dense embeddings are generated by models like BERT, SBERT, and GPT, which transform each word into a compact and fixed-length vector (for example, 768 or 1024 dimensions). This representation is very good at capturing the semantic connections between words and is suitable for tasks that require a deep understanding of the meaning of the text, such as semantic search and similarity ranking in RAG.

In contrast, sparse embedding uses traditional techniques such as TF-IDF or bm25, which produce very high-dimensional vectors that are mostly zeros. Although it may seem a bit wasteful of space, it is very effective in precise keyword retrieval systems such as search engines and traditional literature retrieval.

Some of today’s advanced RAG processes also combine the advantages of dense and sparse embeddings to form a so-called hybrid search method, which not only ensures that the found content matches the keywords, but also ensures that there is a deeper semantic connection between the content, thereby improving the overall retrieval accuracy.

10. Embedding model selection and RAG practice

There are several key factors to consider when choosing an embedding model. First, whether it can handle the length of the document, which involves the context window size; then the cost of each token, which is related to the cost of use; then the quality of the model, such as evaluated by MTEB score or benchmark performance; and the balance between semantic richness and computational efficiency, that is, the choice of dimensions; and finally, the impact of tokenization units. At the same time, it is also very important to test the performance of the model in specific datasets and application scenarios.

The workflow of the RAG system using the embedded model is roughly as follows:

  1. Preprocessing: Here the input text is split into tokens and these tokens are converted into embeddings in the form of vectors using a pre-trained model (like BERT or sentence transformer).

  2. Indexing: This is to save the generated embedding information into a dedicated vector database, such as tools like FAISS, Pinecone, or Weaviate.

  3. Retrieval: When a query comes in, it is time for retrieval. The system will generate a corresponding embedding for the query, and then use cosine distance or approximate nearest neighbor search to find the top k most similar documents.

  4. Ranking: Re-rank the found documents according to some additional criteria like bm25 score or the result of cross encoder.

  5. Enhancement: Feed the selected documents to a large language model (LLM), such as OpenAI’s GPT series, Claude, or Mistral, and let them generate answers based on the facts and background information provided by these documents.

This entire process ensures that the final answer is both accurate and relevant.