DeepSeek-R1 Technical Details

Explore how DeepSeek-AI improves the reasoning ability of language models through pure reinforcement learning.
Core content:
1. Comparison of reasoning ability between OpenAI o1 series models and DeepSeek-AI
2. Training methods and performance analysis of DeepSeek-R1-Zero and DeepSeek-R1
3. Detailed explanation of GRPO, the reinforcement learning framework of DeepSeek-R1-Zero
OpenAI's o1 series of models were the first to introduce reasoning-time extensions by increasing the length of the Chain-of-Thought (CoT) during reasoning, achieving significant improvements in various reasoning tasks such as mathematics, programming, and scientific reasoning. Previous studies have explored different approaches, including process-based reward models, reinforcement learning, and search algorithms such as Monte Carlo Tree Search and Beam Search. However, none of these approaches have achieved a level of general reasoning performance comparable to OpenAI's o1 model.
DeepSeek-AI uses pure reinforcement learning (RL) methods to improve the reasoning ability of language models, explores the potential of large language models (LLMs) to develop reasoning ability without any supervised data, and focuses on their self-evolution through pure RL processes. Based on this method, DeepSeek-AI released its first generation of reasoning models DeepSeek-R1-Zero and DeepSeek-R1. Among them, DeepSeek-R1 performs comparable to OpenAI-o1-1217 on reasoning tasks, as shown in the figure below.
DeepSeek-R1-Zero is trained only with reinforcement learning (RL) without supervised fine-tuning (SFT), but shows remarkable reasoning ability. However, it faces challenges such as poor readability and language promiscuity. DeepSeek-R1, adds multi-stage training and cold-start data before RL to address these issues and further improve reasoning performance, achieving comparable performance to OpenAI-o1-1217 on reasoning tasks.
How to train DeepSeek-R1-Zero
DeepSeek-R1-Zero uses DeepSeek-V3-Base as the base model and adopts Group Relative Policy Optimization (GRPO) as the RL framework to improve the model's reasoning performance.
GRPO estimates the benchmark by group scoring, thus omitting the value model which is usually the same size as the policy model. Specifically, for each question GRPO from the old strategy Sample a set of outputs , and then optimize the policy model by maximizing the following objective :
in, and is a hyperparameter, is the advantage function, which is obtained by using the reward set corresponding to each set of outputs calculate:
DeepSeek-R1-Zero uses a rule-based reward system, which mainly consists of two types of rewards:
• Accuracy Reward : The accuracy reward model evaluates whether the response is correct. • Format Bonus : The format bonus model forces the model to place its thought process between the ‘<think>’ and ‘</think>’ tags.
The figure below shows the performance trajectory of DeepSeek-R1-Zero on the AIME 2024 benchmark. As reinforcement learning (RL) training progresses, DeepSeek-R1-Zero's performance continues to improve steadily. It is worth noting that the average pass@1 score on AIME 2024 has increased significantly, jumping from an initial 15.6% to 71.0%, reaching a performance level comparable to OpenAI-o1-0912. This significant improvement highlights the effectiveness of RL algorithms in optimizing model performance.
So how does RL effectively generalize models? The self-evolution process of DeepSeek-R1-Zero shows how RL can drive models to autonomously improve their reasoning capabilities. By launching RL directly from the base model, the progress of the model can be closely monitored without being affected by the supervised fine-tuning phase. This approach provides a clear view of the evolution of the model over time, especially in terms of its ability to handle complex reasoning tasks.
As shown in the figure above, DeepSeek-R1-Zero's thinking time continues to improve throughout the training process. This improvement is not the result of external adjustments, but rather an inherent development within the model. DeepSeek-R1-Zero gradually improves its ability to solve increasingly complex reasoning tasks by extending the time of calculation at test time, from generating hundreds to thousands of reasoning tokens, allowing the model to deeply explore and optimize its thinking process.
One of the most notable aspects of the self-evolution process is that the model exhibits complex behaviors when computation increases during testing. For example, the model reflects—looking back and reevaluating its previous steps—and explores alternative ways to solve problems. These behaviors are not explicitly programmed, but are the result of the model's interaction with the reinforcement learning environment. This spontaneous development significantly improves DeepSeek-R1-Zero's reasoning capabilities, enabling it to solve more challenging tasks with greater efficiency and accuracy.
DeepSeek-R1-Zero's "aha moment" : While training DeepSeek-R1-Zero, a particularly interesting phenomenon was observed, namely an "aha moment" occurred, as shown in the figure below. At this stage, DeepSeek-R1-Zero learned to allocate more thinking time to a problem by re-evaluating its initial approach. This behavior not only demonstrates the growth of the model's reasoning ability, but also vividly demonstrates how reinforcement learning can lead to unexpected and complex results.
How to train DeepSeek-R1
DeepSeek-R1-Zero faces challenges such as poor readability and language confusion. DeepSeek-R1 combines a small amount of cold start data with a multi-stage training process to address these issues and further enhance inference performance. The DeepSeek-R1 training strategy includes four stages: cold start, reinforcement learning for inference, rejection sampling and supervised fine-tuning, and reinforcement learning for all scenarios.
Cold Start
Unlike DeepSeek-R1-Zero, in order to avoid the unstable cold start phase in the initial reinforcement learning (RL) training based on the original model, DeepSeek-R1 constructs and collects a small amount of long thought chain data for fine-tuning the model and serving as the initial RL actor.
DeepSeek-R1 collects thousands of cold start data to fine-tune DeepSeek-V3-Base as the starting point for RL. The advantages of cold start data compared to DeepSeek-R1-Zero include:
• Readability : The content of DeepSeek-R1-Zero is often not suitable for reading, for example, the mixture of multiple languages or the lack of Markdown formatting make it difficult for answers to stand out. In contrast, when creating cold start data for DeepSeek-R1, a readable mode was designed that includes a summary of each reply and filters out unfriendly replies. • Performance potential : By carefully designing cold start data patterns in combination with human experience, we observed better performance than DeepSeek-R1-Zero. Therefore, iterative training is considered to be a better training method for inference models.
Reinforcement Learning for Reasoning
After fine-tuning DeepSeek-V3-Base on cold start data, the same RL training process as DeepSeek-R1-Zero was used to enhance the model's capabilities in reasoning tasks, such as well-defined problems with clear answers such as code, mathematics, science, and logical reasoning. During the training process, it was observed that language mixing often occurs in the chain of thoughts (CoT), especially when the RL prompts involve multiple languages. To alleviate this problem, a language consistency reward was introduced in RL training, which is calculated by calculating the proportion of target language vocabulary in the chain of thoughts. Although ablation experiments show that such alignment leads to a slight decrease in model performance, this reward is more in line with human preferences and makes the output more readable. Finally, the accuracy of the reasoning task is added to the language consistency reward to form the final reward, and reinforcement learning is performed on the fine-tuned model until the reasoning task converges.
Rejection Sampling and Supervised Fine-tuning
When the RL for reasoning converges, the checkpoints are used to generate SFT data for the next round of training. Unlike the cold start data that focuses on reasoning, this stage adds data from other fields to enhance the model's capabilities in writing, role-playing, and other general tasks. Specifically, the data generation and model fine-tuning process is as follows:
• Inference data Reasoning prompts are carefully selected, and reasoning trajectories are generated by performing rejection sampling on the above reinforcement learning (RL) training checkpoints. In the previous stage, only data that can be evaluated by rule-based rewards was included; however, in this stage, the dataset is expanded by introducing additional data, part of which uses a generative reward model, that is, the real answers and model predictions are input into DeepSeek-V3 for judgment. In addition, because the model output is sometimes confusing and difficult to read, thought chains of mixed languages, long paragraphs, and code blocks are filtered out. For each prompt, multiple responses are sampled and only the correct answers are retained. In the end, a total of about 600,000 reasoning-related training samples were collected. • Non-inference data For non-inference data such as writing, factual question answering, self-awareness, and translation, the DeepSeek-V3 process is used and part of the DeepSeek-V3 SFT dataset is reused . For some non-inference tasks, a potential chain of thoughts is generated through prompts before answering the question. However, for simpler queries (such as "hello"), CoT is not provided. In the end, a total of about 200,000 training samples not related to reasoning were collected.
DeepSeek-V3-Base was fine-tuned for two rounds using the approximately 800,000 samples mentioned above.
Reinforcement learning for all scenarios
To further align the model with human preferences, a secondary reinforcement learning phase was implemented to improve the usefulness and harmlessness of the model while further optimizing its reasoning ability. Specifically, the model was trained by combining reward signals with a diverse prompt distribution. For reasoning data, the methodology in DeepSeek-R1-Zero was followed, using rule-based reward signals to guide the model's learning process in the fields of mathematics, code, and logical reasoning. For general data, a reward model was adopted to capture human preferences in complex scenarios. On this basis, the DeepSeek-V3 pipeline was extended to follow a similar distribution of preference pairs and training prompts. For usefulness, the final summary of the generated content was focused on to ensure that the evaluation focused on the practicality and relevance of the answer while minimizing interference with the underlying reasoning process. For harmlessness, the entire response generated by the model (including the reasoning process and summary) was fully reviewed to identify and mitigate possible risks, biases, or potentially harmful content. Ultimately, by combining reward signals with a diverse data distribution, a model was successfully trained that excels in reasoning ability while prioritizing usefulness and harmlessness.
Distillation: empowering small models to reason
In order to make a more efficient small model with similar reasoning capabilities as DeepSeek-R1, we directly use the 800,000 samples selected by DeepSeek-R1 to fine-tune open source models such as Qwen and Llama. The results show that this simple distillation method significantly enhances the reasoning capabilities of small models.
For the distilled model, only supervised fine-tuning (SFT) was performed, and the reinforcement learning (RL) stage was not included, although the introduction of RL may significantly improve the model performance, because the main goal of this study is to demonstrate the effectiveness of the distillation technique.
Using Qwen2.5-32B as the base model, distillation directly from DeepSeek-R1 outperforms the effect of applying RL to it, suggesting that the reasoning patterns discovered by the larger base model are critical to improving reasoning capabilities.