Taking DeepSeek-V3 as an example, understand Pre-train and Post-train

Deepen your understanding of pre-training and post-training, and uncover how DeepSeek-V3 can help you go from zero to mastery.
Core content:
1. Pre-training: The basics of building a general language model
2. Detailed process and optimization of DeepSeek-V3 pre-training
3. Post-training: The key steps to improve model practicality and alignment
What is Pre-Train?
What is Post-Train?
1. Pre-training: Using massive data to lay the foundation for general knowledge
Pre-training uses large-scale unlabeled corpus to train language models to learn to model the probability distribution of natural language without instructions or tasks, thereby acquiring general language understanding and generation capabilities .
1) The problem it solves is:
- How the model predicts the next word/sentence
- How to establish semantic and grammatical connections between words and sentences
2) The output is:
- A general language model (base model) that grasps language rules, world knowledge, and partial reasoning ability
What does DeepSeek-V3 pre-training do?
1. How much data is used
DeepSeek V3 is pre-trained on 14.8 trillion high-quality, diverse tokens. Compared with DeepSeek V2, V3 optimizes the pre-training corpus, increases the proportion of mathematics and programming samples, and expands multi-language coverage beyond English and Chinese.
2. How long does the context support?
During pre-training, DeepSeek-V3 improves its context processing capabilities in stages:
From 4K → 32K → eventually support 128K tokens
3. What is the training task?
More than just "predicting the next word". DeepSeek-V3 uses Multi-Token Prediction (MTP) :
- The model predicts multiple future tokens at the same time (e.g. next 2 words, next 3 words)
- Maintain the causal chain and improve the ability to learn and express planning
( We have previously discussed MTP in detail, so we will not repeat it here )
4. In addition, some performance optimization work has been done
DeepSeek-V3 has made system-level optimizations, such as:
- Architecture optimization : using Mixture-of-Experts (MoE) + MLA (Multi-Head Latent Attention)
- FP8 mixed precision training : The first verification of the effectiveness of FP8 training on extremely large-scale models
- DualPipe algorithm : Design an efficient pipeline parallel algorithm to reduce pipeline bubbles
2. Post-training: Let the model go from "knowing a lot" to "saying correctly and answering well"
Post-training refers to further fine-tuning the model based on the pre-trained model through task data and preference signals provided by humans, so that it can understand instructions, perform tasks, and generate responses that meet human expectations, thereby improving its alignment, practicality, and safety .
1) The problem it solves is:
- Whether the model can understand specific task instructions (such as writing summaries, answering questions)
- Whether the response meets human preferences (concise, appropriate, relevant)
- Whether the expression is stable, logical, and low in hallucinations
2) The output is:
- An aligned model that can understand and execute human instructions
- Possess basic assistant capabilities and can be used in dialogue systems, code collaboration, document processing, and other scenarios
What did DeepSeek-V3 post-training do?
For DeepSeek V3, post-training mainly includes two core steps: supervised fine-tuning and reinforcement learning . Although these two steps are simple in concept, they are full of sophisticated technical design and innovative ideas in their implementation.
1. Supervised Fine-tuning (SFT): Understanding and executing human instructions
Supervised fine-tuning is the first key step in the post-training phase of DeepSeek V3, which aims to guide the pre-trained model into an assistant that can understand and execute human instructions.
The DeepSeek team built a curated dataset of 1.5 million cross-domain instruction instances, using differentiated data construction methods for different types of tasks:
1) Detailed creation of inference data :
For data that requires deep thinking (such as math problems, programming challenges, and logic puzzles), the team did not manually construct question-answer pairs, but instead adopted knowledge distillation.
DeepSeek first develops expert models (trained by combining SFT and RL) for specific domains (such as code, mathematics, or general reasoning). These expert models become data generators, providing two types of SFT samples for the final model:
- Original question and answer pair
- A triple consisting of a prompt, a question, and an R1 answer
In this way, the final model can have both the reasoning depth of R1 and a good output format.
2) Human-computer collaboration for daily interactive data : For scenarios such as creative writing, role-playing, and simple question-and-answer sessions, the team used DeepSeek-V2.5 to generate initial responses, which were then reviewed and verified by human annotators to ensure the accuracy and appropriateness of the answers. This human-computer collaboration not only improves data creation efficiency, but also ensures data quality.
2. Reinforcement learning: Optimizing answers to better match human preferences
SFT is just "imitation", while reinforcement learning is "optimizing preferences" - letting the model learn what kind of answers are more popular and reasonable.
1) Where does the reward come from?
DeepSeek V3's reward model uses a dual-track reward system to provide precise feedback based on the nature of the question:
Rule-based objective rewards : For questions with definite answers (such as math or programming questions), a regularized verification mechanism is designed:
- Require the model to provide the final answer in a specific format, which is then verified by the rules
- For programming problems, use the compiler to generate objective feedback based on test cases
This approach provides a reliable evaluation criterion that is not easily manipulated.
Model-based flexible rewards : For open questions or subjective tasks, we use the reward model trained from the DeepSeek-V3 SFT checkpoint for evaluation
2) Innovation of optimization strategy: Adopting group relative strategy optimization (GRPO) :
- Generate multiple answers for each prompt
- Compare the score differences within the group and construct an advantage function
- Optimize strategies to make answers more in line with preferred directions
This method is more stable and computationally efficient than traditional PPO, and is more suitable for large model training. ( For detailed explanation, please refer to the previous content )
3. As usual, let’s summarize
Through the training process of DeepSeek-V3, we can clearly see the two key stages of the growth of large models:
- Pre-training : Use ultra-large-scale data to build a language "base" to give the model general understanding and expression capabilities.
- Post-training : Through instruction data and human preference guidance, the model can better understand the task, be more human-friendly, more practical and safer.