RAG's key embedding model competition at home and abroad

Written by
Silas Grey
Updated on:July-03rd-2025
Recommendation

Explore the competition between domestic and foreign large models in the key Embedding technology field and find the model that suits you best.

Core content:
1. The important role of Embedding models in supporting large models
2. Performance comparison of mainstream Embedding models at home and abroad
3. Actual measurement of the performance and application of each model in specific scenarios

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

Although the context supported by large models is getting larger and larger, whether due to the large knowledge base or security considerations, we still hope to provide the model with appropriate context. Among them, choosing a suitable embedding model is crucial. How can we find a better embedding model? I hope this article can provide some reference.

background

We can't use technology for the sake of technology. It's better to explore it to solve a specific problem. Why do I want to learn about embedding? Because MCP is quite popular recently, and I often need to analyze the commit history of some repositories in my work to find the introduction or modification history of certain content, that is, I want to talk to git history. Although sometimes we can do it in traditional ways, writing MCP can use natural language to obtain such as:

  • Who is responsible for XX gameplay?
  • What content has been developed recently?
  • What are the main features being developed in the past month?
  • What features are available in March this year
  • What are the recent changes to a file?

The answers to these questions would be great. This information may be based ongit logFurther retrieval, and our large project is composed of dozens of small warehouses, the difficulty has risen a level. However, the complete solution has been developed almost, today I will talk about how to solve the first challenge, embedding! I planned a PK competition to see which model is more suitable for my scenario.

Let me lay down a layer of armor first. I am not in the AI ​​field. This is based on my hobbies and special scenario testing. Due to the limitations of my personal knowledge, there may be some misunderstandings. You are welcome to correct me.

Introduction of domestic and foreign models

What is embedding? The description on Wikipedia is rather abstract. Here is Tencent Hunyuan T1’s explanation:

The Embedding model is a technology that maps high-dimensional data (such as text and images) to a low-dimensional vector space. It achieves efficient computing and similarity analysis by retaining the semantics and feature information of the original data. Its core principle is to map similar data points to similar positions in the vector space through neural network training. For example, the vectors of "cat" and "dog" are closer than those of "cat" and "apple", thereby capturing semantic associations.

There is a ranking list on huggingface [1] where you can view the performance of different models. This is useful for understanding which models are good, but in practice, actual testing may be more reliable.

I plan to select some free and open source models, and also test some closed source models to see how much improvement they have and whether they are worth paying for. This test scenario has the following steps:

  1. Some git commit messages are generated by AI.
  2. Based on these messages, they are handed over to the various embedding models to be tested for vectorization.
  3. By entering the query question and performing similarity (cosine similarity) search, we can obtain the top 5 commit messages.
  4. Let AI score each embedding model (a bit of repetitive work, let’s see how AI performs) and see what the quality of the query is.

There are some unexpected results in actual tests, let us wait and see.

Introduction to open source embedding models

After checking some information online, I selected the following models that were recommended the most for subsequent testing.

Model NamedescribeDimensionsMaximum tokenSupported languages
text-embedding-gte-large-zhGTE Large Chinese Embedding Model (Local)1024512Chinese
text-embedding-bge-large-zh-v1.5Baidu's open source Chinese-English large-scale embedding model (local)1024512Chinese, English
text-embedding-m3e-baseM3E base embedding model (local)768512Chinese, English
text-embedding-granite-embedding-278m-multilingualGranite multilingual embedding model (local)768512Multi-language (English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, Chinese, etc.)
text-embedding-multilingual-e5-large-instructE5 Large Multilingual Embedding Model1024512Multilingual

originaljina-embeddingsI also want to submit this series of models, but unfortunately they are not well supported in LM Studio. Maybe it is not the right time, so I will skip it for now. If any of you have experience using it, please leave a message to share the actual effect.

Introduction to closed-source embedding models

OpenAI and othertext-embedding-3series, as well as domestic giants such as BAT and ByteDance, all have their own embedding models and are eligible to participate. This depends on the model providers I mentioned in OneAPI [2] before . As long as they have an embedding model, they are eager to join the group PK.

Model NamedescribeDimensionsMaximum number of tokensSupported languages
text-embedding-3-largeOpenAI's third generation large embedding model30728191Multilingual
hunyuan-embeddingTencent Hunyuan Embedding Model10241024Chinese, English
doubao-embedding-large-text-240915Doubao Embedding Model10244096Chinese, English
Baichuan-Text-EmbeddingBaichuan Embedded Model1024512Chinese, English
text-embedding-v3Tongyi Qianwen Embedding Model10248192Chinese, English
Embedding-V1Baidu Embedding Model1024384Chinese, English

It can be found that although the paid model has not yet started the competition, it is already ahead in terms of three dimensions (dimension, maximum token, supported languages) :) As expected, if there is no special feature, I dare not charge. Uh, Baidu, Baichuan, what's wrong with you?

Embedding Arena

To ensure fairness and justice, the entire PK process has been recorded in the Github repository: https://github.com/kevin1sMe/embedding-selector [3] , and everyone is welcome to watch.

Let’s publish the test questions first. I asked AI to generate the following test data and query questions:

"""
Test dataset: Chinese and mixed Chinese and English commit messages
"""


# Various styles of commit messages as test data
COMMIT_MESSAGES = [
    # Plain Chinese commit messages
    "Fix the slow loading speed of the home page" ,
    "Optimize user login process" ,
    "New data export function" ,
    "Fixed the crash issue reported by users" ,
    "Update Documentation" ,
    "Refactored the code structure to improve maintainability" ,
    "Removed deprecated API calls" ,
    "Add unit test case" ,
    "Changed the default settings in the configuration file" ,
    "Solved compatibility issues on iOS devices" ,
    
    # Mixed Chinese and English commit messages
    "fix: Fixed the bug of login page" ,
    "feat: added new payment interface" ,
    "docs: Update API documentation" ,
    "refactor: Refactor the user authentication module" ,
    "test: Added test for checkout process" ,
    "style: Adjusted the style of UI components" ,
    "perf: Optimized database query performance" ,
    "chore: Updated package dependencies" ,
    "fix(ui): modal component close button failure issue" ,
    "feat(api): Added user data synchronization endpoint" ,
    
    # Commit messages mixed with technical terminology
    "Fixed Redis connection pool leak issue" ,
    "Optimizing rendering performance of React components" ,
    "Added Elasticsearch index management function" ,
    "Refactor JWT authentication logic to improve security" ,
    "Solved the problem of high memory usage of Docker containers" ,
    "Add GraphQL query caching mechanism" ,
    "Updated Webpack configuration to improve build speed" ,
    "Fixed the data inconsistency problem caused by multi-thread concurrency" ,
    "Added heartbeat detection for WebSocket connections" ,
    "Optimized the execution efficiency of MongoDB aggregation queries" ,
    
    # Commit messages related to team collaboration
    "Modify code based on Code Review feedback" ,
    "Merge the latest changes from the develop branch" ,
    "Prepare for v2.0.0 release" ,
    "Fixed regression issues reported by QA team" ,
    "The new requirements raised by the product manager have been realized" ,
    "Provisional submission, WIP: User management module" ,
    "Cooperate with the backend API to adjust the corresponding frontend code" ,
    "Update component style according to UI design draft" ,
    "feature flag that adds new functionality" ,
    "Resolve merge conflicts, keeping both changes" ,
]

# Query statements for testing
TEST_QUERIES = [
    # Function-related queries
    "How to fix a bug" ,
    "Add new features" ,
    "Update Document" ,
    "Optimize performance" ,
    "Refactoring Code" ,
    
    # Technology related queries
    "About React component submission" ,
    "Database Optimization" ,
    "API Development" ,
    "UI interface adjustment" ,
    "Docker related issues" ,
    
    # Process related queries
    "Modifications after code review" ,
    "Release Preparation" ,
    "Fix problems found in testing" ,
    "Merge branches" ,
    "Conflict Resolution"

Open Source Division

For the open source model, I used local LM Studio to deploy it, and I tried my best to choose the latest version (2025-3-29).

Let’s call the top five contestants in this competition F5. Let’s begin the competition!

python3 scripts/run_test.py -m \
    text-embedding-m3e-base \
    text-embedding-bge-large-zh-v1.5 \
    text-embedding-gte-large-zh \
    text-embedding-granite-embedding-278m-multilingual \
    text-embedding-multilingual-e5-large-instruct \
    -o results/open-source-f5.json

Although it is a local deployment, it can handle this amount of computing in seconds. Let's take a look at their results:

ModelProcessing time (seconds)Data volume
text-embedding-m3e-base0.740
text-embedding-bge-large-zh-v1.51.1840
text-embedding-gte-large-zh1.1240
text-embedding-granite-embedding-278m-multilingual0.6840
text-embedding-multilingual-e5-large-instruct1.2340

We checkopen-source-f5.jsonThe output:

[
  {
    "model_name""text-embedding-m3e-base" ,
    "precision@1" : 0.0,
    "precision@3" : 0.0,
    "precision@5" : 0.0,
    "processing_time" : 0.6969938278198242,
    "query_results" : [
      {
        "query""How to fix the bug" ,
        "top_results" : [
          {
            "rank" : 1,
            "message""fix: Fixed the bug in the login page" ,
            "score" : 0.837168656326853
          },
          {
            "rank" : 2,
            "message""Fixed the crash issue reported by users" ,
            "score" : 0.8329215028808162
          },
          {
            "rank" : 3,
            "message""Modify code according to Code Review feedback" ,
            "score" : 0.8251477839600121
          },
// Omit subsequent lines      

There is a lot of content, which is dazzling. We first let AI evaluate and score it, and chose the new Google model that is currently known as the strongest on the surface.Gemini-2.5-Pro ​​experimental 03-25Come and rate it and see how it works?"

Model search result comparison table

Note: Due to limited space, only a part of the content is captured. For the full content, please refer to the code repository.

Query Statementtext-embedding-m3e-basetext-embedding-bge-large-zh-v1.5text-embedding-gte-large-zhtext-embedding-granite-embedding-278m-multilingualtext-embedding-multilingual-e5-large-instruct
How to fix the bug1. fix: Fixed the bug of the login page (0.837)
2. Fixed the crash problem reported by users (0.833)
3. Modified the code according to the Code Review feedback (0.825)
4. Fixed the data inconsistency problem caused by multi-threaded concurrency (0.807)
5. Fixed the regression problem reported by the QA team (0.791)
1. Fixed the crash problem reported by users (0.599)
2. fix: fixed the bug of the login page (0.581)
3. Fixed the data inconsistency problem caused by multi-threaded concurrency (0.576) 4.
Modified the code according to the Code Review feedback (0.541)
5. Fixed the Redis connection pool leakage problem (0.532)
1. fix: fixed the bug of the login page (0.623)
2. Fixed the crash problem reported by users (0.608)
3. Fixed the slow loading speed of the home page (0.592)
4. Fixed the Redis connection pool leakage problem (0.555)
5. Collaborated with the backend API to adjust the corresponding front-end code (0.527)
1. fix: Fixed the bug of the login page (0.770)
2. Fixed the crash problem reported by users (0.724)
3. Fixed the Redis connection pool leak problem (0.688)
4. Fixed the slow loading speed of the home page (0.687)
5. Fixed the regression problem reported by the QA team (0.682)
1. Fixed the regression problem reported by the QA team (0.918)
2. Fixed the crash problem reported by users (0.916)
3. fix: fixed the bug of the login page (0.914)
4. Modified the code according to the Code Review feedback (0.907)
5. Refactored the code structure to improve maintainability (0.895)
Adding new features1. Added data export function (0.859)
2. Added feature flag for new functions (0.845)
3. Feat: Added new payment interface (0.822)
4. Added Elasticsearch index management function (0.815)
5. Updated Webpack configuration to improve build speed (0.812)
1. Added data export function (0.710)
2. Added feature flag of new function (0.653)
3. Feat: Added new payment interface (0.637)
4. Implemented the new requirements proposed by the product manager (0.631)
5. Optimized the user login process (0.625)
1. Added data export function (0.627)
2. Added feature flag of new function (0.602)
3. Implemented new requirements proposed by product manager (0.548)
4. Feat: Added new payment interface (0.524)
5. Updated component style according to UI design draft (0.511)
1. Added feature flag for new features (0.875)
2. Added data export feature (0.804)
3. Feat: Added new payment interface (0.792)
4. Added Elasticsearch index management function (0.702)
5. Implemented new requirements proposed by product managers (0.687)
1. Added feature flag for new features (0.954)
2. Added data export feature (0.944)
3. Implemented new requirements proposed by product manager (0.933)
4. Updated documentation (0.931)
5. Merged the latest changes in develop branch (0.924)
Updated documentation1. Updated documentation (0.957)
2. docs: Updated API documentation (0.888)
3. chore: Updated package dependencies (0.791)
4. Updated Webpack configuration to improve build speed (0.785)
5. Merged the latest changes from the develop branch (0.774)
1. Updated documentation (0.857)
2. docs: Updated API documentation (0.772)
3. Added data export function (0.580)
4. Updated component style according to UI design draft (0.577)
5. Modified the default settings in the configuration file (0.558)
1. Updated documentation (0.871)
2. docs: Updated API documentation (0.791)
3. Merged the latest changes from the develop branch (0.586)
4. Added data export function (0.582)
5. Added feature flag for new functions (0.541)
1. Updated documentation (0.930)
2. docs: Updated API documentation (0.804)
3. chore: Updated package dependencies (0.691)
4. Merged the latest changes in the develop branch (0.667)
5. Prepared for v2.0.0 release (0.653)
1. Updated documentation (0.980)
2. docs: Updated API documentation (0.953)
3. Added data export function (0.920)
4. Prepared for v2.0.0 release (0.919)
5. Merged the latest changes in the develop branch (0.914)
Optimizing performance1. Optimized the rendering performance of React components (0.841)
2. perf: Optimized the database query performance (0.817)
3. Modified the default settings in the configuration file (0.800)
4. Solved the problem of high memory usage of Docker containers (0.798)
5. Fixed the problem of slow loading of the home page (0.794)
1. Optimized the rendering performance of React components (0.632)
2. perf: Optimized the database query performance (0.595)
3. Optimized the user login process (0.586)
4. Refactored the code structure to improve maintainability (0.564)
5. Fixed the data inconsistency problem caused by multi-threaded concurrency (0.554)
1. Optimized the rendering performance of React components (0.645)
2. perf: Optimized the database query performance (0.611)
3. Updated the Webpack configuration to increase the build speed (0.581)
4. Fixed the crash problem reported by users (0.572)
5. Resolved the compatibility issue on iOS devices (0.567)
1. perf: Optimized database query performance (0.726)
2. Optimized the rendering performance of React components (0.719)
3. Optimized the user login process (0.684)
4. Fixed the problem of slow home page loading speed (0.644)
5. Updated documentation (0.631)
1. perf: Optimized database query performance (0.931)
2. Optimized the rendering performance of React components (0.925)
3. Optimized the user login process (0.913)
4. Fixed the problem of slow home page loading (0.907)
5. Optimized the execution efficiency of MongoDB aggregation query (0.905)

Evaluation and analysis

From the table above, we can see the performance of different models under different query statements. Overall:

  • text-embedding-multilingual-e5-large-instruct gives relatively high scores under all query sentences, and the correlation of the results is also relatively high. This shows that the model performs well in understanding Chinese and mixed Chinese and English commit messages and query intent.
  • text-embedding-m3e-base can also give high scores in many queries, but the relevance of some results is problematic.
  • The overall performance of text-embedding-granite-embedding-278m-multilingual is relatively balanced, but the scores are generally lower thantext-embedding-multilingual-e5-large-instruct, which may mean that it is slightly inferior in depth of semantic understanding.
  • The generally low scores for text-embedding-bge-large-zh-v1.5 and text-embedding-gte-large-zh may indicate that they are more suitable for specific types of tasks or that they do not perform well when processing mixed-language and specialized terminology data such as commit messages.

Judgment basis

  1. Relevance: Are the results returned by the model highly relevant to the intent of the query? For example, for the query "how to fix a bug", the results returned should focus on commit messages related to bug fixes.
  2. Accuracy: Whether the results returned by the model truly reflect the content of the commit message.
  3. Sorting: whether the results with high relevance are ranked first, that is, to examine the sorting ability of the model.
  4. Score: Although the score is not a perfect representation of the quality of the model, in general, the higher the score, the more confident the model is in the results.
  5. Stability of overall performance: whether the model can maintain good performance under different types of query statements.

in conclusion

Based on the above analysis, the text-embedding-multilingual-e5-large-instruct model is the most suitable for our commit message retrieval task. It has a higher score and higher result relevance, indicating that it can better understand the query intent and return more accurate and useful results. It is better in retrieval accuracy, coverage, and stability, and is competent for tasks such as commit message retrieval. Although other models may perform well under certain specific queries, overall,text-embedding-multilingual-e5-large-instructMore stable and reliable under all query types.

Let's take a look at the sizes of the models above.text-embedding-multilingual-e5-largeIt is obviously larger than the others, which may be one of the reasons, or it may be larger just because it supports multiple languages. However, before this model entered the competition, the other 4 models secretly competed with each other, and the result wastext-embedding-m3e-baseThis smallest model won the championship. So it’s not that “the bigger the base, the more powerful it is” :)

Closed Source Zone

We followed the same testing method and took these paid models out for a walk. We first let the domestic models compete with each other, and then see if the gap with foreign models is obvious. The domestic models that "qualified" for testing have been mentioned before, so let's look at the results directly (due to space limitations, only a part of it is intercepted, and the full content can be viewed in the code repository):

Query StatementhunyuanbaiduqwendoubaoBaichuan
How to fix the bug1. Fixed the crash problem reported by users
2. Fixed the slow loading speed of the home page
3. Fix: Fixed the bug of the login page
4. Fixed the data inconsistency problem caused by multi-threaded concurrency
5. Fixed the Redis connection pool leakage problem
1. fix: fixed the bug of login page
2. fixed the crash problem reported by users
3. fixed the data inconsistency problem caused by multi-threaded concurrency
4. fixed the problem of slow loading speed of home page
5. fixed the problem of Redis connection pool leakage
1. Fixed the crash problem reported by users
2. fix: fixed the bug of the login page
3. Fixed the Redis connection pool leak problem
4. Fixed the regression problem reported by the QA team
5. Modified the code according to the Code Review feedback
1. Fix the regression problem reported by the QA team
2. Fix: Fixed the bug in the login page
3. Fixed the crash problem reported by users
4. Modify the code according to the Code Review feedback
5. Fixed the data inconsistency problem caused by multi-threaded concurrency
1. Fixed the crash problem reported by users
2. fix: Fixed the bug of the login page
3. Fixed the data inconsistency problem caused by multi-threaded concurrency
4. Fixed the problem of slow loading of the home page
5. Modified the code according to the Code Review feedback
Adding new features1. Added data export function
2. Implemented new requirements proposed by product manager
3. Added feature flag of new function
4. feat: added new payment interface
5. chore: updated package dependency
1. Added data export function
2. Added feature flag of new function
3. Added Elasticsearch index management function
4. Updated document description
5. Implemented new requirements proposed by product manager
1. Added feature flag for new features
2. Added data export feature
3. feat(api): Added user data synchronization endpoint
4. Updated documentation
5. Implemented new requirements proposed by product managers
1. Updated documentation
2. Added feature flags for new features
3. Optimized user login process
4. Deleted obsolete API calls
5. Restructured code to improve maintainability
1. Added data export function
2. Added feature flag of new function
3. Optimized user login process
4. Implemented new requirements proposed by product manager
5. feat: added new payment interface
Updated documentation1. Updated documentation
2. docs: Updated API documentation
3. chore: Updated package dependencies
4. Updated Webpack configuration to improve build speed
5. Modified code based on Code Review feedback
1. Updated documentation
2. docs: Updated API documentation
3. Updated component styles based on UI design draft
4. chore: Updated package dependencies
5. Added data export function
1. Updated documentation
2. docs: Updated API documentation
3. Added data export function
4. chore: Updated package dependencies
5. Updated component styles according to UI design draft
1. Updated documentation
2. docs: Updated API documentation
3. Deleted deprecated API calls
4. Optimized user login process
5. Refactored code structure to improve maintainability
1. Updated documentation
2. docs: Updated API documentation
3. Added data export function
4. Updated component style according to UI design draft
5. Optimized user login process
Optimizing performance1. Modified the default settings in the configuration file
2. Fixed the data inconsistency problem caused by multi-threaded concurrency
3. Added GraphQL query cache mechanism
4. Updated Webpack configuration to improve build speed
5. perf: Optimized database query performance
1. perf: Optimized database query performance
2. Optimized the rendering performance of React components
3. Optimized the user login process
4. Fixed the problem of slow home page loading
5. Optimized the execution efficiency of MongoDB aggregation query
1. Optimize the user login process
2. perf: Optimize the database query performance
3. Optimize the rendering performance of React components
4. Fix the problem of slow loading of the home page
5. Restructure the code structure to improve maintainability
1. Fixed the slow loading speed of the home page
2. Optimized the user login process
3. perf: optimized the database query performance
4. Fixed the crash problem reported by users
5. Deleted the abandoned API calls
1. perf: Optimized database query performance
2. Optimized user login process
3. Optimized the rendering performance of React components
4. Updated Webpack configuration to improve build speed
5. Optimized the execution efficiency of MongoDB aggregation queries

Let's ask Gemini-2.5-pro to analyze it:

Comprehensive evaluation:

Based on the above analysis of all queries, the recall accuracy of each model is as follows:

  1. Baichuan: The most outstanding and stable performance. In most queries, especially those involving specific technologies (React, database, UI) and development processes (refactoring, modification after code review, conflict resolution), its recall results are the most relevant, and the top 5 results have the most strongly relevant or related content. It seems to have the best understanding of Chinese and English mixed and professional terms.
  2. Baidu: Overall performance is also very good, just behind Baichuan. It performs well in queries such as API development, UI adjustment, and bug fixes. However, its performance in some individual queries (such as database optimization) is subpar, which may be due to deviations in some semantic understanding or over-reliance on keywords.
  3. Qwen: The performance is above average. It performs well on some queries, but when deeper semantic understanding or differentiation of specific technical fields is required, the relevance of the recall results sometimes decreases, and some weakly relevant or generalized results are mixed in.
  4. Hunyuan: Average performance. It performs OK on simple, clear queries (such as fixing a bug), but the recall relevance is often not as good as Baichuan and Baidu on more complex or technical queries.
  5. Doubao: The worst performance. In multiple queries (such as adding new features, updating documents, optimizing performance, and React components), its recall results contain a large amount of irrelevant content, showing that it has obvious deficiencies in understanding query intent and commit message semantics.

Final conclusion: The Baichuan model has the highest recall accuracy and is most suitable for this type of task.

Judgment basis and examples:

  • Advantages of Baichuan (example):

    • Accurate understanding of technical terms and fields: When querying "database optimization", Baichuan recalled all 5 directly related commits (perf: Optimized database query performance,Optimized MongoDB aggregation queries...,Add GraphQL query caching mechanism,Added Elasticsearch index management function,Fix the Redis connection pool leak issue), covering multiple database-related aspects such as performance optimization, caching, indexing, connection pooling, etc. This shows its deep understanding of database terminology and optimization methods.
    • Accurate grasp of UI/component related content: When querying "About React component submission" and "UI interface adjustment", Baichuan can accurately recallOptimizing React components...,style: Adjusted UI components...,According to the UI design draft...,fix(ui): modal...Highly correlated commits, outperform most other models.
    • Stability: Maintains high recall quality in most queries, and rarely recalls a large number of completely irrelevant results.
  • Disadvantages of a less suitable model (Doubao) (example):

    • Poor semantic understanding ability, easy to recall irrelevant results: When querying "add new features", Doubao's top 5 results only have 1 strong correlation (Added feature flag for new features), the remaining 4 areUpdated documentation,Optimize user login process,Removed deprecated API calls,Refactored the code structure..., which are far from the intention of "adding new features".
    • Overgeneralization or keyword matching: When querying "optimize performance", Doubao recalledFixed the issue of slow loading of home page,Optimize user login process,perf: Optimized database query performanceThese three are related, but also recallFixed the crash issue reported by usersandRemoved deprecated API callsThe latter has little to do with performance optimization, probably because it simply matches words like “fix” and “optimize”.
    • Poor consistency: It has problems recalling irrelevant results across multiple queries, indicating a general difficulty in understanding the semantics of specific texts such as commit messages.

Therefore, based on the comprehensive analysis of all query results, the Baichuan model showed the highest recall accuracy and the best semantic understanding ability in this evaluation, and is the best choice for completing this task. Listen to both sides to make a clear judgment? This result was also analyzed by Deepseek-R1, which confirmed the fact that Doubao ranked last in this round (maybe it was because they were not good at this scenario?), but the first place was Baichuan and the first place was Baidu.

I originally planned to invite foreign embedding models from Google and Anthropic, except for OpenAI. Unfortunately, the latter two models either do not support Chinese or I do not have the API KEY (this is my problem), so they all said they could not be used. So I brought up the three sisters of OpenAI for evaluation. The following are their results (only a part is intercepted due to space limitations, the full content can be viewed in the code repository):

Query Statementtext-embedding-3-smalltext-embedding-3-largetext-embedding-ada-002
How to fix the bug1. Fixed the bug of the login page
2. Fixed the crash problem reported by users
3. Modified the code according to the Code Review feedback
4. Fixed the regression problem reported by the QA team
5. Fixed the slow loading speed of the home page
1. Fixed the bug of the login page
2. Fixed the crash problem reported by users
3. Fixed the regression problem reported by the QA team
4. Fixed the data inconsistency problem caused by multi-threaded concurrency
5. Fixed the Redis connection pool leakage problem
1. Fixed the crash problem reported by users
2. Fixed the regression problem reported by the QA team
3. Fixed the bug of the login page
4. Fixed the data inconsistency problem caused by multi-threaded concurrency
5. Fixed the slow loading speed of the home page
Adding new features1. Added feature flag for new features
2. Feat: Added new payment interface
3. Merged the latest changes from the develop branch
4. Added data export function
5. Restructured the code structure to improve maintainability
1. Added feature flag for new features
2. Added data export feature
3. Feat: Added new payment interface
4. Added Elasticsearch index management function
5. Added unit test cases
1. Added feature flag for new features
2. Added data export feature
3. Added Elasticsearch index management feature
4. Added unit test cases
5. feat: Added new payment interface
Updated documentation1. Updated documentation
2. docs: Updated API documentation
3. Merged the latest changes from the develop branch
4. chore: Updated package dependencies
5. Updated component styles based on UI design drafts
1. Updated documentation
2. docs: Updated API documentation
3. Added data export function
4. chore: Updated package dependencies
5. Prepare for v2.0.0 release
1. Updated documentation
2. docs: Updated API documentation
3. Modified default settings in configuration files
4. chore: Updated package dependencies
5. Prepare for v2.0.0 release
Optimizing performance1. perf: Optimized database query performance
2. Optimized the rendering performance of React components
3. Optimized the execution efficiency of MongoDB aggregation queries
4. Refactored the code structure to improve maintainability
5. Optimized the user login process
1. perf: Optimized database query performance
2. Optimized the rendering performance of React components
3. Optimized the user login process
4. Optimized the execution efficiency of MongoDB aggregation queries
5. Restructured the code structure to improve maintainability
1. Optimized the rendering performance of React components
2. perf: Optimized the database query performance
3. Optimized the execution efficiency of MongoDB aggregation queries
4. Optimized the user login process
5. Restructured the code structure to improve maintainability

Comprehensive evaluation:

  1. text-embedding-3-large: Best performer . It provides the most relevant results for most queries, especially those that require understanding of specific technical actions (such as API development, React component-related, and refactoring). Although it still lacks in filling results for some queries, its core recall relevance and accuracy are the highest among the three. It seems to have a stronger ability to capture the terms and implicit intent in the commit message.
  2. text-embedding-3-small: Performs well and is a strong competitor . In many queries, it performs very close tolarge, sometimes even slightly better on individual queries like UI tuning. Considering it is a "small" model, its performance is impressive. Its main weakness is that it can be slower thanlargeMix in more irrelevant or weakly relevant results.
  3. text-embedding-ada-002: The performance is relatively weak . Although it performs well in some direct queries (such as fixing bugs and optimizing performance), it lags behind in queries that require more detailed distinction and understanding (such as database optimization and API development).largeandsmall, recalling more irrelevant results. It seems to be more easily influenced by surface keywords, and is not as good at grasping deep semantics as the new generationtext-embedding-3series.

Tencent Yuanbao webpage gives the following summary:

Evaluation Dimensionstext-embedding-3-largetext-embedding-ada-002text-embedding-3-small
Technical recall rate92%85%78%
Semantic boundary accuracy89%76%68%
Mixed text processing94%83%72%
Process Task Recall86%91%88%

Multiple AIs voted unanimously for: text-embedding-3-largeModel. It has the highest recall accuracy and is most suitable for this type of task.

This is actually a bit different from what I expected. I thought that the reason why small is cheaper is probably because of its smaller dimension (1536 vs 3072), which has no impact on this scenario. However, it is weaker than large model in multiple recalls. Maybe high dimension is indeed useful? However, the old ada model is obviously behind in this scenario. Should those who use the old model consider upgrading?

Finally, let's take a look at the current recommendations at home and abroad (derived from Gemini2.5 pro): Recommended level:

First Tier: Highly Recommended (Overall Best Performance)

  1. text-embedding-3-large:
  • Reason: It has the strongest overall strength among all models. It not only provides highly relevant results in most queries, but also has a strong understanding of specific technical actions (such as API development details, specific UI fixes, etc.).fix(ui)) and subtle semantic differences. Its recall results usually have the highest relevance ranking and accuracy, and the least amount of irrelevant results mixed in. It is the first choice for the best recall effect.
  • Example advantages: API development query recalls the most comprehensive; React component query can capturefix(ui)detail.

Second Tier: Strong Contenders

  1. Baichuan:

  • Reason: The best performance among the first batch of models, the overall strength is very closetext-embedding-3-large. It is particularly good at understanding database optimization and UI/component related terms, showing that it may be well optimized for the Chinese technology field. For scenarios focusing on these areas, it may be related tolargeChoices that go hand in hand.
  • Example advantages: The most comprehensive recall in database optimization queries; excellent performance in UI/component queries.
  • text-embedding-3-small:

    • Reason: AslargeA small version of theada-002and most of the models in the first batch. In most queries, followed bylargeandBaichuan, recall relevance is high. Considering its possible better cost-effectiveness and faster speed, if the requirements for extreme performance are slightly lower, or cost is an important factor,smallIt is an attractive option.
    • Example advantages: overall stable performance,largeHighly similar, but may be more cost-effective.

    Third Tier: Good / Acceptable

    1. Baidu:
    • Reason: It performed second best among the first batch of models, with good overall recall, especially in queries such as API development and bug fixes. However, its disadvantage is that its stability is slightly inferior to Baichuan, and it performs poorly in some queries (such as database optimization). It may have shortcomings in some semantic understanding or over-reliance on keywords.
    • Example Advantages: API development, bug fixes, good recall. Disadvantages: database optimization query performance is poor.

    Fourth Tier: Fair / Use with Caution

    1. Qwen:
    • Reason: The performance is moderate and can handle some queries, but when more detailed distinctions or in-depth understanding of semantics are required, the accuracy is not as good as the top three tiers, and it is more likely to recall weakly related or generalized results.
  • text-embedding-ada-002:
    • Reason: As the previous generation of OpenAI model, its performance has beentext-embedding-3The series is significantly superior. Although it is OK for simple queries, it performs poorly in complex queries or queries that require distinguishing technical fields (such as database optimization and API development), and the recall results are mixed. Unless there is a specific reason (such as compatibility), it is not recommended to be preferred.

    Fifth Tier: Not Recommended

    1. Hunyuan:
    • Reason: Overall performance is mediocre, and only works well for the most straightforward queries. Recall accuracy is low when processing complex or specialized queries.
  • Doubao:
    • Reason: It performed the worst in this evaluation, returning a large number of completely irrelevant results in many queries, showing that it has obvious difficulties in understanding the semantics of commit messages.

    Summary suggestions:

    • Seeking the best results: Prioritize **text-embedding-3-large**.
    • Balance between effect and cost: text-embedding-3-smallIt is an excellent choice, with near-top performance and possibly better economy.
    • Advantages in specific fields: If your application scenario is highly concentrated in database, infrastructure or specific Chinese UI components,Baichuan** is also worthy of special consideration and testing, as it demonstrates its expertise in these areas.
    • Alternatives: If the first three are not available,BaiduThis is an option that can be considered, but be aware of its potential stability issues.
    • Avoid using: Avoid usingHunyuanandDoubaoUsed for such tasks that require high semantic understanding accuracy.text-embedding-ada-002It should also be considered an outdated option.

    I'm also sorry for Hunyuan and Doubao in this scene. It's not that you are not good, it may be that we are not suitable: P

    postscript

    Because there is a lot of test data, the article seems long. As for how to find a suitable embedding model, although you can see how good some models are, they may be biased. It is recommended that you test and verify according to your specific application scenario and data. You can also see that my test is for this special scenario, and you may need to consider many factors:

    • The dimension of the model, different dimensions can express different semantics. I heard that the higher the dimension, the better, but it depends on the scope of your data.
    • Languages ​​supported by the model. Many models do not support the language you want to use, so they will naturally be eliminated.
    • Data security. Based on this consideration, we may need to use open source models to build our own services.

    Although Google did not participate in the competition today, it is rumored that it also has special capabilities. By specifying the task category during embedding, it may be used more accurately. For details, see: Google Embedding Task Types [4]

    Which one will I choose in the end? You guess.