2025·Everyone should know a little about the basic AI technology-how are ChatGPT and DeepSeekR1 trained?

Written by

Clara Bennett

Updated on:July-17th-2025

Written in front.

A week ago, as Deepseek amazed the world, I wrote a study note about distillation technology from the perspective of a novice liberal arts student, "Deepseek sharks are crazy, but 90% of people don't know what the distillation technology it mentions is." I didn't expect that more than 40,000 people read it. This also made me realize that: AI has become a household buzzword, but there are still many people like me who have a strong desire for knowledge but are blocked by technical barriers.　

Of course, there are a lot of great AI research learning materials on the Internet, but it is easy to be daunting for people without industry experience and technical background. Take myself for example, it took me half a year to consciously learn AI-related knowledge before I could gradually understand it in my own language system like I do now.　

At present, each of us is in the torrent of the AI era. In order not to be ruthlessly swallowed by the tide of the times, it has become a necessity to master some basic AI technology knowledge. Therefore, I decided to continue to update relevant content, hoping to build an easy-to-understand knowledge bridge for those friends who have lost their way in the maze of AI technology terms. After all, understanding is the best way to break through the fog of the unknown.　

Of course, these contents are still just my personal study notes, and it is possible that there are some mistakes due to my inability to understand them. If there are any experts who know the subject, please correct them and don't lead others astray.　

⚪️　

This time, let’s first understand how the big model is trained. Including:　

The main training process of the large language model
Introducing the training process of several top models

ChatGPT
R1-ZERO & Deepseek R1

A picture shows the training process of a large model

Large language model training is divided into two key stages: pre-training and post-training .　

In the pre-training stage, the model is like an adventurer exploring the treasure house of knowledge alone. Through self-supervised learning, it mines common knowledge from massive data, such as language rules and image features . This process does not require manual annotation and relies entirely on the model's "self-learning". The final output is basic models such as BERT and GPT, which have knowledge reserves but need further polishing.　

The post-training stage determines whether the model is practical , and it is roughly divided into four parts:　

Fine-tuning : adapting the model to specific tasks, just like a graduate needs to learn job skills before taking up a job. There are full parameter fine-tuning (adjusting all parameters), efficient fine-tuning (adjusting only some parameters to save resources) and domain transfer (shifting from a general domain to a specific domain).
Alignment : Make the model consistent with human values. Like RLHF and DPO, in simple terms, the model is trained based on human feedback so that its output meets human expectations, such as ChatGPT can communicate smoothly with people.
Deployment optimization : improve efficiency and reduce costs. Model compression can reduce the size of the model and the number of parameters; knowledge distillation can transfer the knowledge of the large model to the small model.
Continuous learning : Let the model learn new knowledge without forgetting old knowledge. Elastic weight solidification protects important parameters, and the playback buffer stores old data for review.

Through these processes, various practical models are born, such as task-specific, aligned, lightweight and dynamically updated models to meet the needs of different scenarios.

From this perspective, training a large language model isn’t that complicated, right?　

⚪️　

Now that we have understood the general training process, let’s take ChatGPT and DeepSeek as examples to explore their training processes.　

ChatGPT training process

(Here is an explanation: although the post-training process is drawn in a larger proportion in the picture here, in fact, the training time of pre-training is much longer than that of post-training. Don’t mistake the proportion of post-training for a larger proportion because of the proportion presented in the picture)

It takes three steps to put an elephant in a refrigerator, and training ChatGPT is only one step longer than putting an elephant in a refrigerator.　

Step 1: pretraning - training a basic model

This step seems to be the simplest but takes the longest time. Almost 99% of the time is spent on this stage. The model is given a large amount of text content to learn to predict the next token when it sees a token. However, the trained model at this time does not have good conversational capabilities.　

For example, when you ask the model "What is the capital of France?", it may answer: "The capital of France is a city", "What is the capital of China?" This is not the expected answer.　

Step 2: Perform supervised fine-tuning on the basic model trained in step 1 - train an SFT model

As in the first step, the training method has not changed, only the data set has been changed. The data set now becomes a human-written Q&A. After this step, the model will have better conversational capabilities.　

For example, when you ask the model "What is the capital of France?", it may answer: "The capital of France is Paris." or "Paris is the capital of France."　

Step 3: Train a reward model

Use the SFT model trained in the second step to perform the continuation task and let the model output multiple continuation results.　

For example, "A: The capital of France is Paris. Paris is a beautiful city.", "Paris is the capital of France. Paris is also the political, economic and cultural center of France."　

Human annotators rate the quality of these continuations. This training process yields a reward model that has the ability to evaluate whether the continuation content meets human preferences and assigns higher rewards to high-quality answers.　

Step 4: Continuous cycle of reinforcement learning

Use the SFT model trained in the second step to answer questions, and then use the reward model trained in the third step to score the answers, and use the scores to perform reinforcement learning. Repeat steps 2-4 to eventually form a ChatGPT model with strong continuation capabilities.　

Training process of DeepSeek R1 & R1-ZERO

Let’s talk about R1-ZERO first　

The characteristic of R1-ZERO is that it skips the SFT step after pre-training and directly conducts reinforcement learning, and the reasoning ability has emerged only through the reinforcement learning model. However, because R1-ZERO has not undergone the process of fine-tuning and alignment, the content effect it outputs may not be in line with human ideal expectations.　

For example, when you ask the model "What is the capital of France?", it may answer you in English.　

DeepSeek R1　

In a nutshell: The R1 model was fine-tuned with a batch of cold start data after pre-training, and then reinforcement learning was performed.

1. Pre-training (not expanded)　

2. Cold start fine-tuning based on the pre-trained model (here it should be the deepseekV3 model). Fine-tune it using a small amount of carefully selected long thought chain cold start data. These data are like some guiding examples for the model, helping it to initially learn some important thinking and answering patterns, such as how to search, correct errors, etc., while also making the model's language expression more standardized to avoid confusion.　

The cold start data is compiled by a group of experts, mainly in mathematics and other science and engineering fields, and contains carefully designed long thought chain data. Simply put, if there is a math problem, this batch of cold start data not only contains the answer, but more importantly, the step-by-step solution to the problem.

3. Carry out reinforcement learning in reasoning tasks that require precise answers, such as writing code, mathematics, and logical reasoning. Using the GRPO policy gradient reinforcement learning algorithm, the model will continuously try to generate answers during training. If the answer is correct and the format meets the requirements, such as showing the thinking process before giving the final answer, you will be rewarded; otherwise, you will be guided to improve.　

4. Distillation. Mix the data generated by the model that performs well in reasoning with the data extracted from general fields (such as chatting and role-playing) from DeepSeek V3, and use this mixed data to re-fine-tune DeepSeek V3, so that the model's capabilities in different aspects are more balanced, so that it can handle both professional reasoning tasks and daily communication and expression.　

5. Perform another round of comprehensive reinforcement learning on the model obtained through the above steps to further improve its performance in various fields, and finally obtain the DeepSeek R1 model.