How to train LLMs to “think” like DeepSeek-R1

Written by

Caleb Hayes

Updated on:July-16th-2025

DeepSeek-R1 (Paper Review: DeepSeek-R1 - Improved Reasoning Capabilities of Large Language Models Driven by Reinforcement Learning) is a recently emerging LLM that has demonstrated strong performance in many fields such as mathematics, programming, and reasoning, especially its "thinking" ability, which has attracted widespread attention in the industry. This article will explore in depth how to train LLMs so that they can "think" like DeepSeek-R1, providing comprehensive guidance for AI researchers from basic principles to specific training methods.

1. Basic principles of LLM training

The training of LLM usually includes three key stages: pre-training, supervised fine-tuning (SFT), and reinforcement learning (RL).

Pretrain
At this stage, the model learns a large amount of general knowledge and lays the foundation for basic capabilities. Through training on a large-scale corpus, LLM can capture the statistical laws of language and lay a solid foundation for subsequent tasks.
Supervised Fine-tuning (SFT)
Based on pre-training, the model's ability to understand and execute instructions is enhanced through instruction and response data sets. The SFT stage (deep understanding of Fine-Tuning: Unlocking the potential of large language models) introduces data for specific tasks so that the model can better adapt to the needs of specific fields.
Reinforcement Learning (RL)
Use human or AI feedback to optimize model performance and ensure that generated content is aligned with user expectations. In the RL stage, through trial and error learning (Deep Seek R1: The synergistic power of reinforcement learning and knowledge distillation), the model can continuously optimize its output, improve task completion and user satisfaction.

The success of DeepSeek-R1 is largely due to its innovation in the RL stage. Below, we will analyze in detail the training method of DeepSeek-R1, especially the construction of its "thinking" ability.

2. Training method of DeepSeek-R1

The training process of DeepSeek-R1 is a complex and sophisticated system engineering, involving the integration of multiple models and technologies. Its core is to stimulate the model's reasoning ability through reinforcement learning and realize the "thinking" function.

1. DeepSeek-R1-Zero: A Preliminary Study of Reinforcement Learning

DeepSeek-R1-Zero is the predecessor of DeepSeek-R1. It is developed based on DeepSeek-v3 (671B parameters) and adopts a unique training method to directly use rule-driven RL techniques (such as group relative policy optimization GRPO) to evaluate the quality of model output.

Skip the traditional SFT stage
Instead of going through the traditional supervised fine-tuning stage, DeepSeek-R1-Zero directly optimizes the model through reinforcement learning. This approach reduces the reliance on human-labeled data and reduces training costs.
Reflect on your own approach
During the training process, DeepSeek-R1-Zero is able to reflect on its own methods and achieve gradual optimization. This self-iterative ability enables the model to continuously discover and improve its reasoning strategies.

Although DeepSeek-R1-Zero has some problems with readability and language mixing, it lays a solid foundation for the success of DeepSeek-R1. Through RL training, DeepSeek-R1-Zero discovered the existence of the "thinking" token and demonstrated amazing reasoning ability.

2. DeepSeek-R1: Reinforcement training combining SFT and RL

To address the readability issue of DeepSeek-R1-Zero, the DeepSeek team adopted a multi-step training strategy combining supervised fine-tuning (SFT) and reinforcement learning (RL).

SFT and Inference Data
First, a large number of long chain of reasoning (CoT) examples are introduced through SFT to help the model understand the expected response format and unlock better reasoning performance. The key to this stage is to show the model clear reasoning examples to guide it to learn the correct reasoning path.
R1-Zero Style RL
Next, the same RL training steps as R1-Zero are applied, but with the addition of a language consistency reward to account for the language mixing problem. This step reinforces the model’s understanding of language regularity and improves the readability of the output.
Mixed Data SFT
Then, SFT is performed using mixed data. The mixed data includes reasoning data and non-reasoning data, the latter of which comes from the SFT dataset of DeepSeek-V3 (DeepSeek-V3 Deep Analysis: A Comprehensive Interpretation of the Next Generation of AI Models) and synthetic data generated by DeepSeek-V3. This stage aims to enable the model to distinguish between reasoning tasks and non-reasoning tasks and improve its practicality.
RL+RLHF
Finally, another round of RL training is conducted, including R1-Zero style reasoning training and RL training based on human feedback. This stage further optimizes the model's reasoning ability and improves its friendliness and harmlessness.

Through the above training process, DeepSeek-R1 not only inherits the reasoning ability of DeepSeek-R1-Zero, but also solves its readability and language mixing problems. It is able to show strong performance on multiple tasks, especially in the fields of mathematics, programming, and reasoning.

3. How to train LLMs to achieve “thinking” ability

Based on the successful experience of DeepSeek-R1, we can summarize some key steps and methods for training LLMs to achieve "thinking" ability.

1. Choose the right base model

First, choose a large language model with strong basic capabilities as a starting point. This model should be fully pre-trained and have rich language knowledge and understanding capabilities. DeepSeek-R1 and DeepSeek-R1-Zero are both developed based on DeepSeek-v3, which shows the importance of a strong basic model.

2. Design a reasonable reward mechanism

In the reinforcement learning stage, the design of the reward mechanism is crucial. The reward should accurately reflect the quality of the model output and motivate the model to continuously optimize its reasoning strategy. DeepSeek-R1 adopts a multi-level reward mechanism including accuracy, format, and language consistency to ensure the model's efficiency in reasoning tasks and the readability of the output content.

3. Introducing the “Thinking” Token

The "thinking" token is one of the key innovations that enables DeepSeek-R1 to achieve reasoning capabilities. By introducing special tokens to mark the model's reasoning process during training, we can make the model understand the task requirements more explicitly and guide it to gradually unfold reasoning. The success of this approach lies in that it provides a structured way to present the model's reasoning process, thereby improving the readability and accuracy of the output.

4. Enhance training with multimodal data

Although DeepSeek-R1 mainly focuses on language and mathematical reasoning tasks, the introduction of multimodal data can further enhance the generalization ability of the model. By integrating the cross-validation mechanism of multiple processing channels such as vision, language, and symbols, the model can better understand the logical relationships in complex scenarios and generate more accurate and reliable reasoning results.

5. Continuous optimization and iteration

Finally, continuous optimization and iteration are the key to training LLMs to achieve "thinking" capabilities. By continuously collecting and analyzing the output data of the model, we can find its existing problems and deficiencies and adjust the training strategies and methods accordingly. In addition, with the continuous advancement of technology and the emergence of new algorithms, we should also update the training framework and tools in a timely manner to improve training efficiency and model performance.

The success of DeepSeek-R1 demonstrates the great potential of reinforcement learning in training large language models to achieve reasoning capabilities. Through reasonable reward mechanism design, the introduction of "thinking" tokens, the use of multimodal data, and continuous optimization and iteration, we can train LLMs with strong reasoning capabilities. These models will show transformative potential in many fields such as scientific research discovery, judicial decisions, and strategic decision-making.

However, we should also see that there are still some challenges and problems in the current LLMs in terms of reasoning ability. For example, how to further improve the accuracy and readability of the model? How to better handle logical relationships in complex scenarios? How to solve the "catastrophic forgetting" problem of the model? These problems require us to continue to explore and innovate to solve them.