What can we learn from the interpretation of 9 papers of DeepSeek (Part 1)

In-depth analysis of DeepSeek's scientific research innovation and development process.
Core content:
1. Interpretation of DeepSeek's public papers and their innovations
2. DeepSeek's two research cores: base model and reasoning ability
3. DeepSeek's emphasis on experiments and data, as well as its spirit of exploration of new architectures
I believe that from the beginning of the year to now, everyone has seen enough interpretation information about DeepSeek, from the initial shock overseas, the surge in users, to the battle for national destiny. Although there are also in-depth analyses and diverse perspectives, in the final analysis, apart from an interview with Liang Wenfeng in "Undercurrent" that I had read in July last year, there are very few first-party sources from DeepSeek, and I always feel that it is a bit superficial. Just yesterday, I saw a more than 3-hour podcast in Zhang Xiaojun Jùn|Business Interviews , hosted by Ho Chun-yin, Assistant Professor of the Department of Computer Science at the Hong Kong University of Science and Technology: "Explaining DeepSeek's 9 key papers and innovations one by one-"The Game of the Brave"" (There was also a previous interpretation of the R1 and kimi1.5 papers that was also very exciting)
In fact, think about it, isn't the paper the most first-hand source of information? Moreover, it is not just an R1 paper, but all the public papers since the beginning of 2024. There may be a lot of content hidden in it that has not been discovered by media teachers and AI technology KOLs, which can really make people understand how DeepSeek has achieved its current achievements in the past year or even longer. So I listened carefully, and made a text version with the help of Tongyi Tingwu's podcast conversion and GPT-o1's fine-tuning terminology capabilities, and shared it with everyone.
The best tribute is learning, learning makes people happy!
above.
From dense models to mixed experts, and then to reasoning
Looking back at the core papers published by DeepSeek over the past year, we can roughly divide its research into two main threads:
- Foundation Models : Evolved from the earliest Dense structure to the MOE (Mixture of Experts) model, and in the process continuously invented and adopted new efficient training algorithms.
- Reasoning ability : including solving math problems, code generation, logical question and answer, and even theorem proving, etc., emphasizing the "thinking depth" of large models, and making continuous innovations in how to conduct reinforcement learning.
Before reading this article-by-article analysis, you can remember several key features of DeepSeek: the company places great emphasis on experiments and data, is adventurous enough to try new architectures and algorithms, and is truly willing to share internal research details and provide the community with reproducible technical reports.
First article January 2024
"DeepSeek LLM: Scaling Open-Source Language Models with Longtermism"
To briefly describe the positioning of this work: In fact, DeepSeek's first paper did not have much innovation, because it was essentially a reproduction of Llama 2. At that time, Llama 2 had just come out, and as a startup, DeepSeek was also very reasonable at the beginning, trying to reproduce the performance of Llama 2 first, and then making some further improvements. So most of the content of this work is actually following Llama 2.
Of course, there are differences in data. For example, DeepSeek makes a Chinese and English model, and the quality of data preparation may be higher. But the architecture of the entire model and some training methods are exactly the same as Llama 2. The model has two main scales, one is a small 7B, and the other is a large 67B. The benchmark Llama 2 also has 7B and 70B. Then they used 2T token data (this is in the same vein as Llama), and then did post-training such as SFT and DPO. Their experimental results show that it eventually exceeded Llama 2 70B, which is also normal. Because Llama 2 was released first, the subsequent work can basically make the data quality better. At that time, many domestic models claimed to exceed Llama 2 70B. The overall significance of this paper lies in the reproduction of Llama 2 and the more rigorous scientific attitude embodied by DeepSeek . I will focus on several places where they show this rigor:
For example, they chose a multistep approach for learning rate scheduling. Generally speaking, cosine annealing is often used for large model training, which means that the learning rate is gradually reduced along a cosine curve. But there is a problem with this: you have to specify how many tokens you want to train at the beginning in order to plan the entire curve. If the amount of data changes dynamically during the actual training process, such as if new data is prepared and wants to be added in the middle, it will be difficult to adjust the original cosine curve. Therefore, DeepSeek adopted a multistep learning rate strategy: it is a constant at the beginning, and it is reduced after training to a certain extent, and then the new constant value is maintained. Although they finally found that there was not much difference in the final performance from cosine annealing, this method is more flexible.
The second point is the difference between them and Llama 2: they have done a very careful study of Scaling Law. Scaling Law roughly means how to predict the optimal configuration of model size, data size and hyperparameters in advance when the training resources (computing power) are fixed. Because the experiment cost of large models is extremely expensive, a method that can be extrapolated is needed to reduce repeated experiments. DeepSeek's paper has done a more rigorous study of Scaling Law. First, they conducted special scaling experiments on hyperparameters (such as batch size, learning rate, etc.) , while many previous works did not conduct systematic research in this regard. For the industry, it may be possible to directly refer to the settings of Llama 2, but DeepSeek treats it as an academic issue and conducts many rigorous experiments, similar to the practice of universities. They also challenged some of the previous overly rough estimates of computing power, such as including the computational overhead caused by attention in the estimate. They proposed a new formula, which is slightly different from the previous formula in parameters, but it will bring certain differences. In this way, the prediction of the optimal configuration will be more precise when extrapolating in practice. It can be seen here that DeepSeek is not simply copying the previous methods, but like researchers, it carefully experiments and examines the scientific logic behind it. This method is very valuable for large models, because it is impossible to try multiple times and can only be extrapolated after testing on small models.
They also emphasized the impact of data quality on the Scaling Law. For example, if the data quality is high, the optimal model and data scale configuration will be different. Many previous works also knew that data quality is important, but it was unclear how it would accurately affect the optimal configuration. DeepSeek takes these factors into consideration, which is very rigorous.
Finally, let's talk about one of the most impressive parts of this paper: they discussed the phenomenon of "ranking brushing" very frankly. In 2023, there were many large models brushing the rankings on the Chinese evaluation list C-Eval: because C-Eval is mostly multiple-choice questions (choose one from four), if these multiple-choice questions are deliberately strengthened during training, very high scores can be achieved on the list, but the actual generalization ability of the model may not be high. DeepSeek did a control experiment: they found that the original model could only get 47 points, but if it was deliberately trained specifically for multiple-choice questions, it could instantly reach 71 points, and the gap was very obvious. What is admirable is that they wrote this ranking brushing process truthfully into the paper, while many companies will not make this public; they will only say "my score is higher than so-and-so", but they will not mention the training method behind it . At that time, we also did a lot of investigations, because C-Eval was maintained and updated by us, and we knew that many models were "high-scoring but low-ability" and achieved good-looking scores by brushing the rankings. But DeepSeek exposed this phenomenon in the paper and emphasized that their public model did not cheat. I personally respect this honest attitude, and I can see that this company is more inclined to the style of academic research, pursuing the understanding of the scientific principles behind it, rather than simply doing publicity. Another domestic company, Kunlun Tiangong, also made a similar revelation, which was very rare at the time. Because for some companies, being listed is more important than being honest. DeepSeek is more like an academic team, presenting their research in a scientific way.
In summary, this DeepSeek LLM paper itself does not have many new model innovations compared to Llama 2, but they rigorously analyzed the principles behind it and supplemented some points that were not discussed in depth before, such as the additional computing overhead of Attention, the relationship between data quality and optimal configuration, etc. They also sincerely demonstrated in the paper the huge impact of ranking manipulation on the evaluation results, and took the initiative to declare that their model did not manipulate the rankings. This style was very rare in the domestic environment at that time, and it also made people look at this company with new eyes.
Part 2 January 2024
"DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models"
Since the second paper, DeepSeek has actually become a mixed expert model, or MoE . Because its first model is actually a dense model, because LLaMA has always been dense. MoE means that the neural network in the transformer is divided into several parts, each of which is an expert, so it is called a mixed expert. Then when there is data coming in and you want to predict it, it may not go through all the experts. For example, there may be an expert who is good at mathematics, an expert who is good at physics, and an expert who is good at literature. If it is a math problem, only the first expert can be used, and the other experts are not needed. The mixed expert model is also called a sparse model. Why? Just like the concept I just mentioned, it is because even if you have N experts, you may only use one expert for the math problem you come in, and your other experts do not play a role, so it is called sparse. Then the dense model is that no matter what you put in, all its parameters are in play, so it is dense. This is a more popular understanding.
Why do people want to be hybrid experts? In fact, DeepSeek is not the first to do MoE. MOE has been around since very early times. Before ChatGPT came out, Google actually had a MoE model. And later, before DeepSeek did MOE, there were already widespread rumors that ChatGPT was a MoE model, so they could reduce costs during inference. So it is natural for DeepSeek MOE to follow this. And why is MoE important? Because the MOE model is very promising, it helps you scale up, which means that you can make your model very large, but at the same time, your inference cost can be very low. From DeepSeek MoE to V2 and V3, they are all MoE, including R1.
Then this paper is not a very ready model product. It is different from the paper I just talked about. In that paper, they seriously made the model and released it to everyone. This paper is more like a study. They wrote some early algorithm strategies and experimental results into a paper and released it. In the paper, they experimented with a 2B model and finally came up with a 16B model, but the 16B MoE is actually a very small MoE. Then here I mainly want to talk about one of the differences they did. In fact, the innovations of this paper are two points:
The first is that they used a lot of experts . How do people usually do MoE? People usually use a small number of experts, that is, 8 or 16. But this paper hopes to do it a little more subdivided. For example, if you divide it into 8 or 16 experts, your division is still too general. I hope, for example, to divide it into 64 or even 128 experts. What is the benefit of this? It can be more granular. For example, if I divide it into 128 experts, I may choose two from these 128 experts each time, because it is divided very finely. What problem does it solve? In the past, when people divided the experts into fewer, such as 8 or 16, what phenomenon would occur? That is, the distinction between different experts is not obvious. Because you have too few experts, then you may have two or three experts who need to learn a lot of things shared by everyone, and learn a lot of new things, that is, the distinction is not obvious. But if I divide it into, for example, 128, I will be very detailed. Then you may learn very different things from each expert. This is a place where this paper is very different from previous work in this setting. In fact, after this paper came out, it sparked a discussion about whether MoE should be done like DeepSeek, using many experts instead of just 8 or 16 experts as before. I think this is a very innovative aspect of DeepSeek.
Then the second is that in addition to this mixed expert, they also have shared experts . He thinks that if these 128 experts are different from each other, but the basic understanding ability of the model, such as understanding of language and common sense, is actually shared for all queries, then I should also have some shared experts, that is, he has both proprietary and general ones. In fact, this design is also quite intuitive, but what do I want to emphasize here? It is that it is not that difficult to think of this thing, but it is still very brave to make this attempt. Because you have to invest a lot of computing power to explore it on a large scale. After all, someone has done it before MoE, for example, 8 experts have done it and the effect is OK, so you just do it directly. In fact, it is the simplest and the risk is relatively low. Why do you have to do something different yourself? DeepSeek will mention many such attempts in the following paper today. Yes, I think they are all very brave. I think as a company, it is rare in the industry.
How many experts did they use for this thing? 64. The later V2 and V3 papers used more experts, and here they used 64 experts. Let's take a look at the parameters here. The parameters here say that each layer has two shared experts, and the proprietary ones are 64. Then, in fact, this model is not very large, just 2B and 16B. And because they did MoE for 16B, their activation parameter is only 2.8B. You can see that the effect of 2.8B is actually not much different from the previous one. To sum it up in one sentence, with only 40% of the computation, DeepSeek MoE achieved the same effect as the previous level. What does 40% mean? For example, if you deploy this model, your cost is reduced to 40%. These are different from what others did before, and then they rushed out and found that the effect is good. I think this gave them confidence, and then they did V2 later.
This paper tells you one thing. I designed the MoE expert in this way, and then I verified that it works well in a not particularly large-scale situation.
Part 3 May 2024
"DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model"
Then the next V2, they still verified DeepSeek MoE in a really large-scale situation, and followed this base to the next base, and it became DeepSeekV2 very naturally. DeepSeekV2 is basically a scale up of the DeepSeek MoE paper just mentioned. It actually first did some verification experiments on a relatively small scale on the DeepSeek MoE paper, and did a very scientific and rigorous study, and then dared to do such a thing under such a large-scale investment. V2 increased the number of experts to 160. Before, everyone had 8 or 16 experts when making large models, but here it suddenly has ten times more than others. But this was not achieved overnight, just like the paper just mentioned, it was actually achieved step by step.
Let's first look at the configuration V2 in the paper. It is a 236B MoE model , which is a large scale. In fact, the activation parameters of this 236B are only 21B , and its length is very strong, supporting 128k length. You can see that the main conclusion of Compared with DeepSeek 67B is that 67B is the 67B in the first paper, and their performance is better. Here you can see its generation throughput. For example, how many tokens can be generated per second, which is 5.76 times faster than before. What does this mean? This model has more than 200B parameters, which is actually much larger than the first version of 67B, nearly 4 times larger, but its computing power is 40% lower than before, its generation speed is more than 5 times faster than before, and its performance is better than before . So you can see that DeepSeek actually started from this, and I think their concept of cost economy has always been throughout their papers.
This paper has a very important thing, which is another thing, Multi-Head Latent Attention (MLA) , which has actually attracted a lot of attention abroad recently, because this thing was really first proposed by DeepSeek , not by others, but by DeepSeek itself. I will try to briefly talk about this thing. As you know, the so-called multi-head attention in transformer, and multi-head attention means that I have many heads, and each head has its own K and value, and then do attention. Because attention is a triple, that is, query, key, value, and then you do attention. For example, when ChatGPT or DeepSeek generates, for example, if you ask it a question, it generates a token and a token to the back, and then every word it says has to do an attention with every word in front. For example, if you input two articles into it, the two articles add up to 5,000 words, and then you ask it to write a new article for you at this time, and it may also have to write 5,000 words. Each of the 5,000 newly generated words must be followed by attention with each of the previous 5,000 words. This attention has many heads, and each head must do this. How do you implement this? Because the previous historical background is called history, it actually calculates K and value for each word. In fact, you definitely don't want to repeat the calculation. You definitely can't say that you will recalculate the K and value every time you generate a word. Because the key and value are bound to the previous ones and have nothing to do with the words you are generating now. So generally everyone says that I have the previous ones, for example, if you have 5,000 words in front of you, the K and value of these 5,000 words are all stored. In this way, when you come to a new word, you only need to calculate the query to calculate the attention, which saves time. But saving time will cost you more space, that is, space for time. Here, something called KV cache will appear , which refers to how much of your GPU memory is occupied by the size of the K and value you stored in the front, which brings a computing cost for deployment. Then, in order to reduce the KV cache, there are actually some papers, such as group query attention (GQA), which means that although I have many heads, my multiple heads share one K and value. DeepSeek's first paper is GQA. Then the third thing is very radical, called multi-query attention (MQA). What is this? It means that all my queries use the same K and value, and I only have one. Although I have multiple heads, all heads have the same K and value, which is equivalent to my KV cache becoming smaller. This is of course very radical, but there is a trade-off. What is the trade-off here? That is, although from MHA to GQA to MQA, everyone will find that the KV cache is getting smaller and smaller, but its performance will also get worse and worse. Although you become more cost-effective and more efficient, your effect is also reduced, so there needs to be a balance here. Then DeepSeek came up with something called multi-head latent attention. It is actually the KV marked here called compressed latent , which is a low-dimensional vector. The K and value on it do not need to be explicitly calculated by yourself, but are mapped from this low-dimensional to high-dimensional. They are called low-rank. It is not only latent, but also low-rank. What does low-rank mean? In linear algebra, for example, my original vector is 1024 dimensions, which is very high-dimensional. Your high-dimensional vector may require more calculations. But now my low-rank, for example, becomes only 100 dimensions. After calculating on these 100 dimensions, I use a matrix to map it back. This is the so-called low-rank key-value joint compression . What is the benefit of this? The benefit brought by this is utility. Of course, there are some very in-depth technical details here, and I will try to explain it in a simple way. That is, when you deploy, you need to store a lot of K and value. The previous method was to store less and less, and the DeepSeek method is that I still want so many heads. I don’t want to reduce the number of my heads, but I want to store less, so what does it store? It only stores the compressed latent KV. You can think of it as when I store, I don’t directly store the K and value, I only store the latent thing in the middle. Then when I calculate, it can actually be converted from K and value. Originally K and value were 1024 dimensions, and now they have become 100 dimensions, so my storage space is directly reduced by ten times, and the KV cache is very small. Relatively speaking, it requires a lot less memory for deployment. In this paper, the KV cache is reduced by 93% .
I think the invention of DeepSeek was also based on cost considerations. They used GQA before, and they probably wanted to further reduce costs, but they thought that using MQA directly would also reduce performance, so they wanted to make a compromise and designed this method. This thing itself also involves dealing with some RoPE (Rotary Position Embedding), which is called the art of position encoding in Chinese. They also have some unique ways to deal with it, which I won’t go into here.
Of course, I may talk about a few technical details in MoE. It needs to balance different experts. This is because you don’t want all the data to rely on two experts in the training, and the other experts have never been used. Then, for example, in order to maximize efficiency, they also balance different GPUs, and even balance the communication between different GPUs and different devices. They hope that not only my experts will be used in a relatively balanced way, but also my different GPUs will be used in a relatively balanced way, and the communication between my different GPUs will also be relatively balanced. Because of this balance, efficiency can be maximized. This is also where I think DeepSeek does a good job. I think these things may not sound so innovative. It may be more like some experience or engineering optimization implementation. But this also played a big role in the cost control of DeepSeek to the later V3.
I think this is still a very brave innovation, because the training is actually very expensive. This is a 236B model, and they trained 8.1T tokens here. They are still at such a large scale, using such a configuration for the first time, using a new attention they invented to do such a large-scale training, which I think is also very rare, because for many companies of the same period, I think we don’t know whether other large models are made public, but basically no one has made such a big innovation among the public ones.
The rest of the long context extension is just some common operations, which other people can also do. In the end, its length reaches 128K context, which is basically the existing technology. Of course, it is difficult to explain it in detail here. They used an extended RoPE, which is the rotation position encoding technology. This paper is also very famous, and basically many people are doing this.
You can see the effect. For example, when they compare with Mistral, Mistral is 8 times 22B, with 8 experts and a total model of more than 100B. Later, when Mistral's model was open sourced, people found a problem, that is, these 8 experts were not really differentiated, and then they did not achieve the original intention of MoE. DeepSeek V2 actually designed so many experts to make it more differentiated and make the experts more granular. And I think Mistral and DeepSeek V2 were basically made at the same time. You can see that even though they are very leading companies abroad, they don't have the courage to make such a big innovation. In such a work at the same time, DeepSeekV2 actually has 160 experts. So you can see that although they are both MoE models, they are actually very different. Of course, the total model of DeepSeek V2 will also be larger, 90B larger than Mistral's, but its activation parameters are still less than it, so its push and deployment cost is even smaller than Mistral. The model effects of V2 and Lama3 are not much different, but Lama3 seems to have more training tokens. Yes, and the summary here is this sentence, that is, with only 21B activation parameters, DeepSeek V2 achieves this compared with some open source models, that is, the top performance.
Then here it shows how the multiple experts are set up, that is, different experts are responsible for different fields. This is learned automatically, not manually set . Yes, their paper does not talk too much about the specific field. Then they may specifically talk about the training cost. You can see here that they are on the H800 cluster, and each trillion tokens use 30 GPU days. Although their model is more than 200B, its training cost is 42.5% lower than their previous 67B model. Yes, and the inference cost can be seen that it is 5.76 times faster than the previous 67B model. And of course they have also made some additional optimizations, such as converting the parameters to FP8 precision when deploying , which is equivalent to lowering the precision of the decimal. They will do a lot of such additional optimizations, including their KV cache, which is just mentioned, and will also do some quantization, which is equivalent to lowering its precision, and then do such deployment.
This is why DeepSeek V2 caused a phenomenon in China at that time. If I remember correctly, it seems that it was from DeepSeek V2 that a price war began to appear in the domestic large model API. This actually helps us understand why he can do this and what its principle is. But I think at that time I had a feeling that DeepSeek V2, including DeepSeek itself, might not care so much about this product. I always felt that DeepSeek's models, especially the early ones, did not do very fine RLHF in post-training, or did it very finely, such as brushing the rankings or something. They did it very much like a research work. Then they may have deployed it a little bit, because their deployment cost is also very low, and they may not care much about their lower price than others. This is the feeling I have. So I think it is still relatively low-key, including after they deployed it, I feel that they did not do any publicity. Then later, because their prices were very low, some media may have started to report on it, and they began to get some attention.
DeepSeekV2 is a scale-up of DeepSeek MoE. It has done a lot of scientific and rigorous studies and introduced original methods such as MLA. It has achieved a significant improvement in the performance and effect of LLM of 160 experts under 236B parameters. At the same time, its extremely low cost has triggered the beginning of the industry's large model cost war.
Chapter 4 December 2024
"DeepSeek-V3 Technical Report"
The last paper on the foundation is the most recent one, DeepSeek V3, which was published in December 2024. This work has attracted a lot of attention. Let's take a look at the basic things in the abstract. This model is very large, 671 billion (671B), which is also the basic model of the later very famous R1. Then you can see that this is larger than V2, almost three times, while V2 is only more than 20 billion. Most of its content continues the ideas of V2, such as MLA, including their strategy for DeepSeek MoE, but its scale is very large. At that time, they mentioned its cost here, and this also attracted everyone's attention because everyone thought it was very low cost. The last one is actually very impressive. They especially emphasized this sentence in the abstract: "one-shot training process." That is to say, their training is very stable, and there is no loss spike during the entire training process. To explain it simply, during the training process, the loss will suddenly have an extreme peak - for example, it suddenly becomes very large, or there is a training anomaly. This was very common in the past, because in the past, large-scale pre-training would have various problems, such as GPU or machine failure, or some unknown reasons causing loss to peak, and then it would have to stop and roll back, and then start over. But they said that this training was very stable, and the entire training was completed smoothly in one go without any rollback. I think this is amazing. Behind this must be their very good engineering optimization and team support. So DeepSeek V3 actually has a lot of space in their paper to write about their engineering implementation. This is different from the style of previous papers, which did not actually spend so much space to write about these engineering details.
This paper has attracted much attention, firstly because of how much money it cost. They used only 2,000 H800 cards to train DeepSeek V3 , especially for a model with more than 60 billion parameters. As you know, there are many companies at home and abroad that have more than 10,000 cards, especially in foreign countries, which may reach hundreds of thousands or even more H100 cards, or even better cards. The number of cards of some large domestic manufacturers is also far more than this number. In fact, in the early days of DeepSeek, the scale of 5,000 or 10,000 H100 cards was considered quite large in China, but later it was found that DeepSeek's cards were not particularly large compared with many companies. In this paper, they also said very frankly that they used 2,000 H800 cards, and the final training cost was about 5.57 million US dollars. This was very shocking at the time, because someone had compared it at the time, such as Llama 3.1, a 400 billion scale (400B) model, the training cost should have cost about 30 million US dollars, which is almost six times more. I think it was from V3 that more and more people began to pay serious attention to the MLA (Multi-head implicit attention) technology invented by DeepSeek MoE and DeepSeek V2. In fact, before that, especially for foreign developers and major large model teams, they might not have regarded them as mainstream. It was not until V3 appeared that everyone found that V3 was both powerful and low-cost that they began to look back at the work DeepSeek had done before. I think that in the past year, they have gradually moved from DeepSeek MoE to V2 and then to v3, and it is not a sudden achievement, because V3 is actually a direct extension of V2.
The following mainly talks about some differences between V3 and V2. The difference is that they adopted a new approach when doing load balancing (balance) - although this part is not very core, they mentioned how to balance between multiple cards and multiple experts. The previous balance usually added a new loss function during training, but this time they did a " loss-free balancing ", which is to use a constant to monitor whether a certain expert is overused. If it is used frequently, adjust the constant so that the expert is used less in the future. This is a very intuitive heuristic method that does not require explicit training. This is a relatively small place.
The main point is here : multiple token prediction (MTP) . This is also the first time this thing appears in DeepSeek. Of course, it is not original to the V3 team itself, but it refers to a new paper and proposes a loss function for multi-word prediction (multiple token prediction loss). What does this loss function do? We all know that the language model is to predict the next token. But during training, using multi-word prediction means not only predicting the next token, but also predicting more tokens below at the same time. This has an advantage: the training signal will be richer, because it is no longer just predicting one token, but also predicting multiple tokens, so that the model can learn how to better infer the content further away. They also mentioned in the paper that the model needs to make certain plans when generating so as to better predict future tokens. It sounds very intuitive, but it is actually risky to try it on a large scale. Although the small-scale experiments in the paper may be verified well, large-scale training may lead to instability or various unexpected results, resulting in the failure of the entire training. However, DeepSeek is very daring to try new things. They may judge that this direction has potential, so they really do it on a large scale. In the final analysis, this is also related to the culture of DeepSeek from the beginning. No one has really used this technology on a large scale before, and DeepSeek V2 has not used it either. They can just follow V2, so why add something new? I think this is another unique feature of DeepSeek. Even though they have proven that V2 is very low-cost and have made extensions, and MLA is also the result of V2, they still want to add new attempts in V3. This reflects their team culture of continuous iteration and innovation. Then there is another advantage of multi-word prediction (MTP), which is that you can predict multiple tokens at a time during training, so you can also generate multiple tokens at a time during the inference stage, not necessarily one by one. Here they mentioned that this multi-word prediction module can be reused for speculative decoding to further reduce generation latency. Speculative decoding is to predict multiple tokens at a time, and then use another small model to determine whether to accept these tokens, and if not, withdraw them, etc. LLM is slower in inference because it is autoregressive generation, which requires predicting one token at a time. This multi-word prediction can generate several tokens at a time, without having to generate the first one and then wait for the next one. Of course, whether doing so during deployment will affect the results requires further research.
Then let's talk about the computing power. 2048 H800s were used . And it was not H100, but H800. Compared with other companies, 2000 H800s are not a lot for this scale. They have done extreme engineering optimization to make the training speed and cost very impressive. Therefore, a large section of the paper describes the engineering and infrastructure, including communications, low-precision mixed training, and so on. It is worth mentioning here that they used FP8 low-precision training . In large-scale training, FP16 or higher precision floating-point operations are often used, but DeepSeek uses FP8 in training, which will greatly improve efficiency and reduce costs, but the risk is that the training is unstable or the effect is reduced, and a lot of engineering experiments are required to ensure feasibility. They pointed out in the paper that although many teams will do low-precision quantization in reasoning, almost no one has used FP8 in real large-scale training . DeepSeek is one of the first teams to successfully adopt FP8 training on a large scale, which has helped them a lot in reducing costs. For example, they specifically studied which intermediate variables must retain higher precision and which can use FP8, so as to ensure stable training. They did very detailed experiments and finally proved that this mixed precision training is feasible and achieved relatively good results. This is also a very unique aspect of DeepSeek. They not only innovate in algorithms, but also pay great attention to engineering optimization.
Then look at their multiple experts (MoE) configuration, with only one shared expert and 256 dedicated experts in each stage. This is also different from V2, which has two shared experts and 160 dedicated experts. This time the model is larger, so the number of experts has also increased to 256. But you can see that DeepSeek has always adhered to the earliest ideas from the initial MoE to V2 to V3. In contrast, the Llama series, from Llama 1 to Llama 2 to Llama 3, has always used dense models without multiple experts (MoE), which makes the training cost of Llama 3 particularly high. In the paper, they also compared Llama 3.1, a 400B model. Llama 3 has a lot of activation parameters, while DeepSeek V3 has only about 3 billion activation parameters of the same size. In terms of deployment cost, DeepSeek V3 is more than ten times cheaper than Llama 3, and it surpasses Llama 3's 400B base model in English, code, mathematics, and especially Chinese. What's more, they only spent five million US dollars on training , and the deployment cost was very low, so V3 caused a great sensation at the time.
There are also some control experiments in the paper, such as verifying whether MTP is useful. They did a control on a MoE model of about 20 billion scale, and the results did get better. I believe that DeepSeekk also verified it on a small scale before officially applying it to V3, and only adopted it on a large scale after confirming its effectiveness. This also shows their innovation.
Finally, let's talk about the post-training of V3. Because the previous discussion was mainly about the large model base itself, the subsequent supervised fine-tuning (SFT) and reinforcement learning may not be mentioned much, but there are still some things worth mentioning about V3: their SFT data is only 1.5 million (1.5M), which is not much for a model with more than 600 billion parameters. I remember that the post-training data of Llama 3 reached tens of millions or even 20 million. However, DeepSeek V3 only used 1.5 million, which is very small. They used DeepSeek R1 for distillation on Reasoning data, but at that time the R1 paper had not yet been published, which was equivalent to an internal version. The data for deep reasoning also used 2.5 (actually a version of V2 with more optimization in data and details) to distill V3, and then did reinforcement learning (RL). It is worth mentioning that they used a rule feedback mechanism instead of a reward model for tasks with strong verifiability such as mathematics and code, and only used a reward model for open-ended question answering. It can be said that at the V3 stage, they have begun to lean towards the rule-based approach instead of relying entirely on the model itself to make judgments. Then they adopted GRPO in the reinforcement learning stage, which will be mentioned in the Reasoning section later.
Finally, the best version of their V3 final effect is to do these subsequent steps on the base model. The amount of SFT data is not large, but some internal distilled data was added, and then reinforcement learning was done, and the effect was significantly improved. But relatively speaking, they did not spend too much space in the paper to discuss brushing the rankings. I personally feel that the DeepSeek V3 team is not too keen on running various rankings, at least from the paper. In the past, DeepSeek V1 era was quite hard on the list, but in V3, it seems that it is no longer overly concerned about this aspect, and they did not deliberately integrate a large amount of evaluation data into the base model to brush the score, which is different from many teams. Anyway, judging from the public attitude of the paper, DeepSeek V3 gives people the impression that it is more focused on low cost, high efficiency and engineering innovation, which may also be related to the fact that they are not so eager to land the product in the short term.