In-depth article | The mystery of DeepSeek R1's RAG retrieval: Why is the "master of reasoning" not good at embedding?

An in-depth article exploring the field of AI, revealing the mystery of DeepSeek R1's poor embedding performance in RAG retrieval.
Core content:
1. The unique value and challenges of the RAG system in AI
2. Analysis of the embedding dilemma of DeepSeek R1 in RAG retrieval
3. Comparison of the performance differences between DeepSeek R1 and other embedding models through authoritative data
In the vast universe of artificial intelligence, the Retrieval Augmentation Generation (RAG) system is becoming a bridge between language models and external knowledge with its unique charm. It not only allows AI to have a broader knowledge reserve, but also effectively reduces "hallucinations" and improves the accuracy and reliability of answers. However, building an efficient RAG system is not easy. Every link is like a key clue in a puzzle game, which is closely linked and indispensable.
DeepSeek R1, a hybrid expert model (MoE) with 671 billion parameters, shines in fields such as math problem solving and code generation with its powerful reasoning ability. However, when it was applied to the RAG system, it exposed an unexpected shortcoming - it is not good at embedding. This can't help but make people wonder: Why does this "master of reasoning" perform mediocrely in the retrieval link of RAG? What kind of technical logic is hidden behind this?
Retrieval "shortcomings": Embedding dilemma of DeepSeek R1
To understand the embedding dilemma of DeepSeek R1, we need to start with its "genes". The training objectives of DeepSeek R1 are mainly focused on logical reasoning and text generation. It is designed to be a "brain" that is good at thinking and expressing, rather than a "library" that is good at memory and retrieval. This difference in training objectives leads to the inherent deficiencies of DeepSeek R1 in the accurate mapping of semantic space.
As a senior AI engineer said: "Different models have different talents. Asking a model that is good at reasoning to do embedding is like asking a sprinter to run a marathon. It is not impossible to complete it, but it is definitely not the best choice."
So, what is the problem with DeepSeek R1's embedding? Data is the best "mirror". In addition to the core task of RAG, we examine DeepSeek R1's performance in other embedding-related tasks to gain a more comprehensive understanding of its capabilities.
I remember a few months ago, when I was researching text classification tasks, I stumbled upon an interesting phenomenon: DeepSeek R1’s performance on this task was actually worse than some specialized embedding models. I was very confused at the time, because DeepSeek R1 has always been an “all-round player” in my mind, with strong reasoning ability and extensive knowledge reserves. I even began to wonder if there was something wrong with my testing method?
After some in-depth research, I found that this is not an isolated case, but a common phenomenon. In the text classification task, the average F1-score of DeepSeek R1 is 88.3%, while the specialized Embedding model text-embedding-3-large is as high as 92.7%. In subtasks such as sentiment analysis and topic classification, the performance of DeepSeek R1 also lags behind models such as Qwen2. These data are all from the authoritative MTEB (Massive Text Embedding Benchmark) ranking [1] , which is an important reference for evaluating the comprehensive performance of text embedding models.
To more clearly demonstrate the performance of DeepSeek R1 on different tasks, we can refer to the data on the MTEB rankings and make a simple comparison:
Even more surprising is that when dealing with low-resource languages, DeepSeek R1's embedding capabilities are even more stretched. In the Swahili news classification task, DeepSeek R1's F1-score is only 62.1%, far lower than multilingual-MiniLM's 75.3%. This means that DeepSeek R1 has obvious shortcomings in understanding and processing the nuances of different languages.
All these data point to one conclusion: DeepSeek R1 is not an all-rounder, and it has limitations in embedding that cannot be ignored. Perhaps we have too high expectations for this "master of reasoning".
Embedding “Draft”: Selection Criteria of RAG Retriever
Since DeepSeek R1 is not good at Embedding, how should we choose the right retriever when building a RAG system? This is like participating in an "Embedding talent show". We need to clarify the judging criteria to select the "best retriever" that best suits the RAG system.
What is the core requirement of RAG search? Is it keyword matching? Of course not. The essence of RAG search lies in the accurate understanding of the user's query semantics and the in-depth mining of relevant documents. It requires the searcher to be like an experienced librarian, who not only knows the title and author of the book, but also understands the content and theme of the book, so as to find truly valuable information for the user.
So, how do we evaluate whether an embedding model has this capability? The MTEB (Massive Text Embedding Benchmark) ranking is undoubtedly an important reference indicator. As an authoritative text embedding model evaluation benchmark, MTEB provides us with a multi-dimensional quantitative standard through 58 data sets covering 8 major tasks.
To better understand the MTEB evaluation system, we can use a Mermaid diagram to show its core task categories:
MTEB adopts a hierarchical evaluation framework, and its core task categories include: semantic similarity, classification tasks, cluster analysis, and retrieval tasks.
Is MTEB really perfect? Of course not. A major limitation of MTEB is that it cannot fully represent real-world RAG application scenarios. For example, MTEB lacks evaluation of its long text processing capabilities, which is crucial for processing long documents in fields such as law and finance. In addition, MTEB's dataset may also have domain bias and cannot fully evaluate the model's adaptability in various fields. As mentioned in this discussion on the limitations of MTEB [2] , MTEB's evaluation results may deviate from the actual application effects.
How can we deal with these limitations of MTEB? One way is to combine actual application scenarios and build more targeted evaluation indicators. For example, in the financial risk control scenario, we can focus on the model's recall rate of financial report terms; in the medical field, we can focus on the model's ability to understand medical literature.
Despite some limitations, MTEB is still an indispensable “barometer” in our “Embedding Selection”. Through the MTEB rankings [3] , we can understand the performance of different Embedding models in tasks such as semantic similarity, text classification, and cluster analysis, thus providing an important basis for the selection of RAG retrievers.
On the MTEB rankings, the Qwen2 series of models have won wide attention for their excellent performance. Qwen2-72B performed well in multilingual retrieval tasks, with an MRR@10 of 0.84 in the XTREME benchmark. In addition, Qwen2 also demonstrated strong strength in long document processing, with a score of 93.1 in the RULER long text evaluation, surpassing GPT-4.
Of course, data is only a reference, and actual application is the only criterion for testing the truth. In order to more intuitively understand the effects of different embedding models in the RAG system, we need to conduct "actual combat exercises" and compare the retrieval effects of different embedding models in real scenarios through actual RAG system application cases.
"Strengthening strengths and avoiding weaknesses": the correct way to open RAG in DeepSeek R1
Since DeepSeek R1 has shortcomings in embedding, does it have no use in the RAG system? Of course not. As a senior AI architect said, "There is no 'one-size-fits-all' model, only 'screws' in the right place."
The biggest advantage of DeepSeek R1 lies in its powerful reasoning and generation capabilities. It is good at extracting key information from multiple search results, performing logical reasoning and knowledge integration, and ultimately generating high-quality, logically rigorous answers. In other words, DeepSeek R1 is an excellent "summarizer" and "thinker" rather than an efficient "searcher."
Therefore, in the RAG system, we should put DeepSeek R1 in the most suitable position for it - the generation link. Let it give full play to its Chain-of-Thought feature, conduct in-depth analysis of the retrieval results like an experienced expert, and give valuable suggestions.
Conclusion: RAG system, there is no "one-size-fits-all", only "the best partner"
Building an efficient and reliable RAG system is like forming an excellent team. It requires a deep understanding of the characteristics of each member and reasonable division of labor and optimization to achieve the best results. DeepSeek R1 is an excellent "thinker" and Qwen2 is an efficient "searcher". Only by combining them perfectly can a truly powerful RAG system be built.
How will the future development trend of RAG technology evolve? Will it be end-to-end training or knowledge graph fusion? Perhaps, the future RAG system will be more intelligent and personalized, and can dynamically adjust the retrieval and generation strategies according to user needs. However, no matter how the technology develops, a deep understanding of the characteristics of each model, and reasonable division of labor and optimization will always be the key to building an efficient RAG system.
On SkyPilot Blog, Kaiyuan Eric Chen also shared his experience in building a RAG system using DeepSeek R1 [4] and summarized some precautions in practice. Their research also confirmed the viewpoint of this article: DeepSeek R1 is good at generation but not good at embedding. In the RAG system, it should play to its strengths and avoid its weaknesses, and be used in conjunction with other models.
In the face of RAG, a complex system engineering, we must look up to the stars and keep our feet on the ground. We must pay attention to breakthroughs in cutting-edge technologies and also pay attention to details in practical applications. Only in this way can we truly create a RAG system that can solve practical problems and allow AI to better serve human society.
Further reading: RAG system optimization tips
In addition to choosing the right embedding model and generating the model, there are many other techniques that can be used to optimize the performance of the RAG system, such as:
Dynamic block optimization: According to different task types, choosing the appropriate block size can improve retrieval efficiency and accuracy. Hybrid retrieval architecture: Combining the advantages of sparse retrieval and dense retrieval can improve the performance of long-tail queries. Hardware acceleration: Using GPU or other specialized hardware to accelerate embedding calculations and model reasoning can reduce response latency.
The future of RAG technology is full of infinite possibilities, let us wait and see!