What can we learn from the interpretation of DeepSeek's 9 papers (Part 2)

Written by

Iris Vance

Updated on:July-14th-2025

Chapter 5 January 2024

"DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence"

Chapter 6 June 2024

"DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence"

The earliest reasoning for Deepseek was DeepSeek Coder, which is equivalent to a proprietary code model. This is also very common in the industry. Llama actually has a code version, including Qianwen. Basically, teams that make large models will have a code version. Because the code model can help everyone write code, it is a very useful professional model in itself, and coding is also part of the reasoning.　

Then there is not much to say about DeepSeek Coder itself, because it is just a relatively ordinary version. For example, the first version is a dense model, which is basically the same as the large model of DeepSeek's first version, and is similar to Llama 2. The only difference is that its training data is code data, not pure text, but basically all code. The size ranges from 1.3B to 33B, and it is all open source. They made such a code model, namely the first version of Deepseek Coder. Later, they also had a Deepseek Coder v1.5. The difference of v1.5 can be simply explained, which is " continue pretrain ". Here is an explanation of what "continue pretrain" is: they call it "additional pretrain". For example, they originally had a general base model such as Deepseek LLM 7B, and then continued to train 2T tokens on this basis, such as continuing to train code. This is "continue pretrain". Ordinary pretraining is to train a coding model from scratch. This is the difference between these two methods. Then they continued to train the first version of the model with 2TB of code, and that 2TB of data was not all code, 70% was code, and there was some other data. The version produced in this way is called v1.5. Both versions are open source. But DeepSeek Coder played a big role in the overall popularity of DeepSeek, because in the early days, especially abroad, Deepseek Code did have a certain popularity, and their code model was very well done, so it was more famous among foreign developers. Then DeepSeek MoE, such as V2, is very large in scale. V2 was 200B at the beginning, and V3 was 600B. Although we know from the paper that its activation parameters and deployment costs are relatively low, the 200B model is not deployable by ordinary people because it requires a strong infrastructure. If developers want to train and use a 200B model by themselves, it is very difficult. Therefore, many people still prefer to use a model of about 7B or 10B. However, Deepseek has not released a model of 7B or 10B since the first version, nor has it released a similar scalable small and medium-sized version. So although Deepseek V2 was very good, not many people actually used it because the model was too large. However, DeepSeek Coder has models ranging from 1B to 30B, covering small to medium-sized scales, and the effect is also good. Therefore, DeepSeek Coder and subsequent versions have been widely used. I have always felt that the impression of DeepSeekk abroad is mainly that "DeepSeek is strong in coding", the code model is very well done, and relatively few people know about other aspects. Coding is really helpful for developers. It is different from Math. For example, the Math model can only help a few people do questions, but Coding can help many programmers and developers write code. Coding is a typical application of large models that can really improve productivity early. Just like many programmers now can't write code without the assistance of large models. At that time, the chat model was more in scenes such as chatting, and the actual improvement of work efficiency was not obvious. And at that time, DeepSeek Coder had a theory: if we focus on the code model, it will also help the general model in reasoning and other aspects.　

DeepSeek Coder V2 has started using MoE, which is normal because it is based on the DeepSeek V2 MoE model (which we have mentioned before). They mentioned in the paper that DeepSeek Coder V2 is further retrained, which means that 6T tokens were further trained from the general MoE model checkpoint of DeepSeek v2 to make the coder. So it can be considered that DeepSeek Coder V2 uses DeepSeek V2 as the base.　

But what I want to focus on here is its reward model . We will talk about R1 later, including various rewards and regular rewards. Now let me say that DeepSeek V2 still uses reward models when doing coding and reward models. For example, they emphasize the importance of reward models in the paper. They still say "We still decide to train a reward model". Although we later learned that they began to abandon this route. Because the whole community was doing coding rewards in this way at the time. They also did a control experiment and found that it was indeed more effective. What are the benefits of a reward model? With a reward model to judge right and wrong, you can choose from them. In fact, when I deploy, I don’t generate one at a time, but 64. Assuming that my reward model can judge which one is the most correct, I will give the most correct one to the user and discard the wrong one. In this way, the accuracy may be higher. But there is a problem here, that is, the cost will definitely increase. Because originally you only need to generate one, now you need to generate 64, and let the reward model choose. In this way, the generation cost will definitely increase, and it will eventually become more expensive. Compared with other methods, such as majority voting (without a reward model, you generate 64 results by yourself and vote to select one. For example, among 64 answers, 32 answers are 1 and 20 answers are 2, I think the final answer is 1) and ORM (result-supervised reward model), you can see that when these three methods are compared, it is found that the process-supervised reward is the best. However, this approach has not been further verified on a large scale. Because when the scale increases, the reward model itself will have various problems; and rule-based rewards are actually more robust. If you have a large amount of data, rule-based is still accurate and stable, but an additionally trained model may have many problems. However, in that paper, they were consistent with the mainstream thinking of the community at the time and were still using the reward model. This is considered an early reasoning attempt.　

Chapter 7 April 2024

"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models"

Then I will talk about the research related to Math. This is a very important paper, but it is not often mentioned. I think it is because this paper is very much like a small academic paper. It is not like the papers just mentioned. It is a big model and then open source. Those papers may be very famous. Because this paper was the first public paper to follow OpenAI. OpenAI published a paper called "Let's Verify Step by Step" . I think that paper led everyone to the reward model. After OpenAI published that paper, the reward model based on process supervision became very popular, and many people did it. DeepSeek-Math is the first public work to reproduce the OpenAI process - equivalent to the process of automatic reproduction without human annotation . At that time, after OpenAI published "Let's Verify Step by Step", there was a conclusion that the reward model with process supervision should be used, which is helpful for decoding, and everyone believed in it. Here is an explanation of the process supervision reward model: for example, math problems have multiple steps of reasoning. Traditional outcome supervision only supervises whether the final result is correct, while process supervision is to judge whether each step is correct . OpenAI's approach is to spend a lot of money to find math professionals to annotate 800,000 data ( PRM800K data set ), and use these data to train the model to predict the correctness of each step. DeepSeek has made an innovation - they do not need manual annotation, but automatically build labels for the correctness of the steps to train the reward model. The specific implementation logic is: assuming that when the model generates to the second step, fix the content of the second step and let the model continue to generate multiple subsequent solutions. If these subsequent solutions are ultimately correct, the second step is considered to be correct; otherwise, the second step is judged to be wrong. Through this method of "letting the model self-verify the subsequent results", DeepSeek has achieved process supervision without manual annotation. Their experiments found that the effect of this automatic annotation is close to human annotation, but the cost is greatly reduced. The core difference between this work (Math) and OpenAI is the annotation method, but the final effect is close.　

This paper has a good status in the academic community because it is actually a milestone in this direction. Because at that time, it was the only publicly available mathematical reward model based on process supervision. Although OpenAI also did it, it was not open source. And many people don’t have the cost to ask others to annotate the data, and they actually use methods similar to DeepSeek, which is similar to the method in this paper.　

This article is very important because the famous GRPO method was proposed in this article. This thing is very simple because it is just a 7B model. In fact, you can see here that it continues to be trained on DeepSeek Coder V1.5 7B, and DeepSeek Math 7B is obtained. For a long time, this 7B has been the best open source mathematical model. Then there is a very important thing, why is it worth talking about? It is this thing they invented, which is GRPO. One of the parts of this paper that is worth reading is the part where they talk about reinforcement learning. Because we know that to this day, in fact, starting from this paper, all the post-processing of DeepSeek for reinforcement learning from DeepSeek V2 to V3 to R1 actually use GRPO. At that time, PPO (Proximal Policy Optimization) was used more by everyone. There was nothing wrong with PPO itself. PPO was first developed by OpenAI. As a classic reinforcement learning method, there was nothing wrong with it. Its only problem was that DeepSeek thought it was too expensive and that it might take up more resources to run. There are four models in PPO, one is the policy model, one is the reward model, one is the reference model, and the other is the value model. Each model in PPO is very large. These things need to be placed on your machine, and the whole thing takes up a lot of resources. Therefore, the cost of PPO is much more expensive than the general SFT. They think this thing is too expensive, and this shows that DeepSeek really wants to reduce costs. They started to do some things to reduce costs and increase efficiency. They made a new thing called GRPO. GRPO does not have this value model . After taking away the value model, this GRPO is equivalent to having one less model when it is stored, and this model may be very large. For example, this model may be a 100B model, and then when it is trained, the freed memory can be used for other things, and the efficiency of the entire training will be higher. This is GRPO. GRPO has been used by them all the time, including R1 later, and GRPO has been widely implemented in various open source frameworks today.　

This paper is one of my favorites, because I started to pay attention to reinforcement learning very early, and I think this paper is very insightful in doing reinforcement learning in Math. Because at that time, this kind of online reinforcement learning was not widely done in the community, and most people were still doing distillation and SFT. Reinforcement learning is not a kind of distillation, but more like a kind of self-improvement. And then there was no widespread trend of doing reinforcement learning at that time. Even after DeepSeek Math came out, reinforcement learning still did not become mainstream, at least not in the open source community or academia. Because one reason is that the training cost is still a bit high, and the cards needed are a bit more, and everyone's SFT effect is also good, so this field has not become mainstream for several months. This paper should have been published in the middle of 24 years, but it was not until several months later that reinforcement learning began to become mainstream. Of course, its reinforcement learning is still a reward model, which shows that DeepSeek did not adopt the latest rule-based reward at this time. What did he do in Math following? Following is the article just mentioned. They used the same method to get the training data set, then trained a reward model based on process supervision, and then used this reward model to do their GRPO.　

It can be seen that for a long time, everyone's idea was to make a reward model, but DeepSeek has done some very interesting experiments. What is done in the field of Reasoning is the Rejection strategy. In fact, this point is also mentioned in the paper of Llama 3. Llama 3's post-training math and coding, and basically all the reasoning is doing the Rejection strategy. What does it mean? That is, my model generates a lot of data by itself, and then I go to see if its answer is right or wrong. If it is wrong, I will throw it away, and keep the correct one. Then use these correct data to continue training the model to do SFT. But there is a problem here, that is, there is an online (online) and offline (offline) problem, which seems very important. For a long time, the entire community did reasoning offline. When doing it online, the data of the model will change as the model is trained. For example, after the data generated by my model comes out, the model changes, and after the change, new data needs to be generated to continue training. This process is continuous and the frequency is very high. But what does offline mean? I use the current model to generate a large data set, then take it back to train once, and then it's over. The data does not change with the update of the model. In the community, many people are doing offline training or some so-called iterative training. Iterative training is between online and offline. I do a big math problem and train it once, but in order to update the data, I can continue to train, do one generation, and then train another generation. The new model is trained based on the new data set, and then go back to update the model. It may become better and continue to do it for several generations. But generally, people do three or five generations, and they can't do it or converge. This is not completely online training. For example, the earliest PPO online training requires many generations of training. For example, R1 or similar training requires more than a hundred generations. Now everyone either does one generation, three generations, or five generations. No one does more than ten or twenty generations of training, let alone one or two hundred generations. So there are very few people doing online reinforcement learning. In the field of reasoning, everyone is still doing SFT and RFT, including Llama 3, which is mainly doing RFT. Although the DeepSeek paper pointed out the value of online training very early , everyone still went back to what to do? Do DPO. DPO can of course be done online, but DPO is more often done offline, doing rejection sampling. I think to some extent it is because it is not very stable, difficult to tune, and has a high cost. Because when doing online training, whether it is PPO or GRPO, you need more computing resources, larger asset resources, and it is relatively unstable, especially after you add a reward model, parameter adjustment becomes more difficult. And doing regular SFT or RFT, you can get good results by just training casually. I think these reasons, plus the immaturity of the community open source ecosystem, have led to this method not becoming mainstream. So I think all kinds of reasons have led to problems at the public level. Although this DeepSeek mathematical paper came out very early and some work has made similar attempts, these have not become mainstream. For several months, it has not become mainstream. We actually mentioned at the beginning that we tried online PPO combined with process supervision reward model for online training of reinforcement learning half a year ago. But we found it difficult to achieve the ideal effect and ultimately failed. I think a lot of other teams have tried this, but probably didn’t succeed, which is why it didn’t become mainstream. Here’s the background.　

Then there is a very interesting chapter in this DeepSeek Math, which is the last 5.2.2. They discussed why reinforcement learning (RL) works and how to achieve more effective reinforcement learning. I think this is directly related to the development of the subsequent R1 model. This shows that the DeepSeek team started thinking about this problem very early. In fact, this was more than half a year ago. Why? Because there was this chart at the time, and the most cited one was "Pass K". What does Pass K mean? It means that I sample K times, 8 or 16 responses, how many of these responses are correct? For example, for my query, if K reaches 90%, it means that for all query samples, assuming 16 responses, 90% of the responses contain the correct answer. Why is this indicator important? Because it measures whether the model has the ability to sample the correct answer. Because when doing reinforcement learning, the model may need to explore on its own. This ability essentially indicates whether the model has the ability to explore the correct answer. If the model does not have this ability, then the effect of reinforcement learning may be greatly reduced. And this paper actually gave a negative signal, telling everyone that reinforcement learning may not be as effective as we think. As you can see, when K=1, the blue line is higher than the green line, which looks very good. But when K increases, especially on the simple mathematical set GSN8K, the green line exceeds the blue line, which means that the model's exploration ability has deteriorated after increasing K. This is similar in other models, and the two lines will eventually overlap. This is different from everyone's expectations. Everyone originally thought that when reinforcement learning can scale and self-iterate, the model should be able to continue to improve and improve. However, the negative signal given by this figure shows that the effect is not as good as expected. Including they themselves said that they found that this improvement actually just ranked the correct answer to the front, rather than fundamentally improving the model's ability through reinforcement learning. This is a very interesting observation because almost no one made this observation at the time. And DeepSeek wrote it very honestly. Because although the results of their own experiments were much better, they poured cold water on themselves and discussed these problems frankly in the paper. Instead of rushing to promote that their method is good, they turned around and studied whether it is really effective. They were very rigorous in their observation of the phenomenon, but instead presented the problem to everyone, telling everyone that this method may not work as expected. Then, they explored the second question: If it seems that this method does not achieve the expected effect, how can we achieve truly effective reinforcement learning? How to achieve more effective reinforcement learning? They did not do these experiments, but began to propose hypotheses, such as better data and how to improve inference. An interesting thing mentioned here is the reward function, because they still use the reward model here. The first point they mentioned is how to improve the generalization ability of the reward model, which is crucial for making the model effective and improving its ability. Looking at it today, we can draw a conclusion. Of course, these two are separate: you can certainly focus on how to better design the reward model, improve its ability, and make it more effective; but from today's perspective, another way is to not use the reward model at all. You only use rules, and the results will become the most robust . For example, in mathematics, if you think the final answer is correct, then you think this solution is correct. Such rules are universally applicable. Whether you are in elementary school, high school, junior high school, or college, there is a standard answer in the end. If the standard answer is correct, I think the solution is correct, or at least there is no big problem, although it is not always the case. For example, you may occasionally get the answer right, but the process is wrong. But this rule is simple and direct. It does not consider whether you are an elementary school student, college student, graduate student or doctoral student. The rule is simple and effective. But the reward model is different. The reward model, assuming that the reward model you trained on elementary and junior high school data can accurately judge whether the math problem is done right, but if you are suddenly given a college-level math problem, your judgment may not be accurate. This is the so-called generalization ability. The reward model is always sensitive, while the rule is robust.　

Chapter 8 "DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data"

Another interesting work is that DeepSeek has also done some theorem proving, DeepSeek Prover. There are two versions of DeepSeek Prover, one is DeepSeek Prover v1, and the other is DeepSeek Prover v1.5. This is very interesting. What does this "prove" mean? I think it is related to every reasoning work they do later. Because proving theorems is essentially using rules. Let me talk about the concept of "prove", which refers to the task of proving mathematical theorems. The theorem proving task is different from general mathematical problems. It has an additional engine called the theorem proving engine . For example, suppose you have a set of formal mathematical languages, you input these statements, just like providing execution instructions to a Python program, and it will tell you whether it is correct. The feedback it gives is very standard, similar to formal language verification in mathematics. What DeepSeek Prover does is to convert informal, such as natural language or informal mathematical problems, into formal language for verification. Its task is to convert informal language into formal language. This approach is very unique in that once converted into a formal language, external tools can help verify its correctness, and this tool does not rely on the reward model . This is related to the issue just discussed. The process of theorem proving is like regularization. As mentioned before, the rules are defined by humans, such as "Is the final answer correct?" or "Did you pass the coding problem?". The regularization of DeepSeek is not defined by humans, but comes from the definition of the theorem proving engine. Online training is not used here, but iterative training is used. The iterative self-update I mentioned earlier is to generate data, use tools to verify whether the results are correct, discard the wrong ones, keep the correct ones, generate new data, and continue the process. This is an iterative process. As we mentioned in DeepSeek Math, online training has been found to be more effective than iterative training.　

DeepSeek Prover v1.5 applies a new approach to reinforcement learning. Unlike the previous iterative self-update, although they still use GRPO, there are some changes, focusing on the reward model. They believe that under normal circumstances, the reward model will make reinforcement learning more effective, but in this theorem proof, the reward signal is sparse, so they did not continue to use the reward model, but chose to remove those parts that are difficult to judge. This is equivalent to taking a detour. Although they gradually gave up the practice of reward models, the influence of reward models still exists in the later period. This model was deployed as soon as it was completed, and some technologies that were very popular in the reasoning field at the time were tried, such as MCTS. This is a way to generate a decision tree through multi-step expansion. In this way, they actually developed a variant of MCTS when reporting the results. This is also very interesting. After the release of OpenAI, everyone discussed whether MCTS was an important method used by OpenAI to implement its technology. Because MCTS needs to build a decision tree when generating, the process is relatively complicated. The R1 model we see today is actually more simplified, and the approach is not that complicated. While these two works are relatively niche and specialize in theorem proving, they are very unique because they focus on scenarios where rules are verifiable.　

Chapter 9 "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning"

As you can see, DeepSeek R1 is the ultimate simplicity. No matter what detours you took or what explorations you conducted before, they all converged on this pipeline. So what is its reward? There are only two rewards:　

1. Compare the output to see if it is correct　

2. Check whether the output of the model conforms to the expected format　

The reason for this is that reinforcement learning (RL) does not explicitly define rules. If the model does not output in a predetermined format, it may deviate from the correct track. At least, we hope that the model can generate in a predetermined format. For example, first think, then output "think token", and finally output the answer. These are all regularized processes and do not rely on reward models. You can see that compared with the previous DeepSeek Coder, DeepSeek Math is still using a reward model, while DeepSeek Prover began to try not to use a reward model. By the time of DeepSeek R1, the reward model was completely abandoned. This is a very simple rule, just directly see whether the final result is correct and whether it meets expectations.　

The "zero" model of DeepSeek R1 is worth everyone's attention. It is called "zero" because there is no previous SFT (supervised fine-tuning), and its base model is directly used for reinforcement learning (RL) . This is different from the traditional approach. Traditional methods such as Llama and other works all perform SFT first and then start RL. This is an innovation, or in other words, although this approach seems simple, not many people may have tried it before. Many people naturally think that it is not necessary to do this, or think that it would be better to do SFT, so that secondary training will be simpler. Therefore, although everyone may not have tried it, the DeepSeek team started from here, and at this time there was the concept of "long COT" (long thinking chain), and with the results of OpenAI as the background, everyone began to monitor the length of the generated thinking chain. Although DeepSeek R1 only used the "zero" model (without SFT), its effect continued to improve when solving difficult mathematical problems (such as AME). Without using distillation technology or external reward models, it relies entirely on its own reinforcement learning on the data set, and the effect is still significantly improved, from 0.2, 0.3 all the way up to 0.7, 0.8. This process is also impressive, because OpenAI technicians later admitted that DeepSeek may use similar technology to them. Although many people have tried to reproduce OpenAI's achievements through complex methods such as distillation and MCTS, no one has adopted this simplified approach. DeepSeek achieves its effects through simple rules and the generation of long chains of thoughts, which is in sharp contrast to previous complex decoding methods. This makes me feel that many people do not fully understand R1. In fact, its ideas are simpler and more efficient than ever before. Instead of relying on complex tools or models, they generate them through simple rules and use reinforcement learning to optimize the model itself.　

I think DeepSeek R1 represents an important turning point in AI model design, which simplifies complex processes to the extreme, emphasizes the powerful ability of rule-based, and breaks away from the traditional approach of relying on reward models. Even though it is a "zero" model, it can achieve excellent results. Its innovative ideas and concise implementation have shocked the entire AI field.　

To recap: R1 was built on top of V3, which was something they had already developed. V3 itself was not simple, it introduced a lot of complex techniques, including MLA in the transition from V2 to V3, and a lot of MoE and shared experts. These techniques were described in detail in previous papers, so R1 was built on top of V3. What did they do next? They did GRPO. GRPO was proposed in the DeepSeek Math paper, and they developed it themselves, so it doesn't need to be explained again. This paper was published six months ago, and they also abandoned the complex techniques they had used before, such as MCTS and reward models, and switched to a simpler regularization method, which ended up working very well. They successfully achieved scale with this method, and the results were also very outstanding, which quickly attracted a lot of attention. But I think this paper is the simplest of all the papers I'm going to talk about today. Because it doesn't have a lot of technical details to discuss, but the results are very good and impressive.　

The previous work is very critical, especially the progress of V3, the use of technologies such as GRPO, although there have been some detours, but the final effect is very obvious. So I think the results of this paper surprised me very much, because the "zero" training mode, I did not expect to achieve such good results before. However, the work mentioned today also shows that it did not appear out of thin air. It integrates their previous experiences, whether successful or failed, and finally produced such excellent results.