A detailed explanation of the entire process of large model training

Written by
Jasper Cole
Updated on:July-16th-2025
Recommendation

In-depth analysis of each key step of large language model training.

Core content:
1. The core ideas and steps of the pre-training stage
2. The role of the Transformer architecture in pre-training
3. The implementation methods of instruction fine-tuning, reward model and reinforcement learning

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)


This article will detail the entire process of training a large language model from scratch , including the pre-training stage, instruction fine-tuning, reward model, and reinforcement learning implementation.


1. Pretraining

Pre-training of large models is a core step in modern natural language processing (NLP) technology, especially in models based on the Transformer architecture (such as the GPT series, BERT, etc.). The goal of pre-training is to enable the model to learn the statistical laws, grammatical structures, semantic relationships, etc. of the language from a large amount of unsupervised text data, so that it can be subsequently migrated to specific downstream tasks (such as text classification, question answering, translation, etc.). This process is like a high school student who systematically receives basic teaching from a teacher during the three years of study, accumulates a lot of knowledge and skills, and is fully prepared for subsequent special training (such as the college entrance examination) .


1.1 Core Idea

Pre-training of large models is usually divided into two categories: self-supervised learning  and unsupervised learning  . Among them, the most common is self-supervised learning, which can learn grammatical and semantic information by training a large amount of unlabeled text without manually annotating data .

The pre-training process includes the following key steps:

  • Goal setting : Learn the internal structure of the language, the relationships between words, and long-range contextual dependencies.
  • Data preparation : Use large-scale unlabeled text datasets, which usually come from the Internet, books, news, etc.
  • Training goal : The model trains itself by predicting certain parts of the text (such as the next token, or the masked part) to learn the laws of language.


1.2 Transformer Architecture Overview

The Transformer architecture is the core foundation of modern large-scale pre-training models (such as GPT, BERT, etc.). It is mainly composed of attention mechanism and feed-forward neural network.

  1. Self-attention mechanism : The self-attention mechanism allows the model to "pay attention" to tokens at other positions in the input sequence when processing each token, thereby capturing long-distance dependencies.
  2. Multi-head Attention  : By combining multiple attention heads, the model can capture a variety of different dependencies in different subspaces.
  3. Feedforward Neural Network : Each Transformer layer includes a feedforward neural network to further process the information from the attention layer.

Further reading: "NLP Basic Knowledge Base | 3 Transformer (Part 2)"


1.3 Specific process

The pre-training process for large models usually follows these steps:


(1) Data preparation

The pre-training process uses large-scale text datasets (such as Wikipedia, BooksCorpus, Common Crawl, etc.), which are usually unlabeled raw texts . These unlabeled raw texts are usually processed as follows:

  • Tokenization  : The original text needs to be tokenized (split the text into tokens). Most modern pre-trained models use sub-word level tokenization methods (such as Byte Pair Encoding, BPE, or SentencePiece, etc.), so that the vocabulary can flexibly handle the out-of-vocabulary (OOV) problem.
  • Embedding  : Each token is mapped into a high-dimensional vector representation through an embedding layer and sent as input to the subsequent processing flow of the model.


(2) Training goal setting

The choice of pre-training tasks will vary depending on the model architecture. Common goals include:

Autoregressive Language Modeling  : This method is mainly used for generative pre-training (such as GPT). The model learns language rules by predicting the next token in a text sequence. That is, in an autoregressive model, each token in the input sequence is predicted based on the previous text information, that is, the next word is predicted based on all the previous words.

Specifically, given a text sequence, the task of the model is to learn the conditional probability distribution:

That is, the model predicts the next word based on all the previous words. At each step, it updates its prediction of the next token based on the previous information.

Autoencoding Language Modeling  : This method is mainly used for BERT-like models, which is trained by predicting masked tokens in the text. That is, part of the tokens in the input sequence are "masked" (usually replaced by a special [MASK] tag), and the goal of the model is to predict these masked tokens based on the context.

For example, given the sentence:


The quick brown fox jumps over the [MASK] dog. 

The goal of the model is to predict that the word at the [MASK] position is "lazy".

This training method enables the model to capture bidirectional contextual information (i.e., considering both the left and right contexts at the same time), which is suitable for tasks that require understanding semantic relationships.


(3) Model training

The model predicts the target value through the forward propagation algorithm, then calculates the loss value between the predicted value and the true value through the loss function, and then optimizes the model parameters using the gradient descent algorithm and the back propagation algorithm. The most common optimization algorithm is Adam, which is very effective when dealing with sparse gradients and large-scale data.

Different language modeling methods have different loss functions:

  • Autoregressive language modeling : For autoregressive models (such as GPT), the loss function usually uses cross-entropy loss, whose goal is to minimize the difference between the probability distribution predicted by the model and the true distribution. That is, in the prediction of each token, the model tries to make the predicted token distribution close to the true token distribution. The loss function is:
  • Autoencoder language modeling : For autoencoder models (such as BERT), the model's loss function is also cross-entropy loss, and the goal is to let the model predict the correct masked token. For example, for a token that is masked at a certain position, the loss function calculates the cross entropy between the token distribution predicted by the model and the true token.

Through such large-scale pre-training, the model can capture the rich structure and semantic information in the language, thus laying a solid foundation for transfer learning of downstream tasks. For example, in the autoregressive model, the model captures the dependency between words by predicting the next word based on the previous context, which helps the model learn the fluency and consistency of the language.


2. Supervised Finetuning

Although the model has learned the statistical characteristics of the language and some basic grammatical and semantic information through a large amount of text data during the pre-training stage, it usually cannot handle specific tasks well. Through the training of the supervised fine-tuning (SFT) stage, the model will learn how to perform better in specific tasks or fields , especially to be more accurate and efficient when dealing with specific problems. It is like the special mock test training that high school students do for the college entrance examination.

The method generally used at this stage is instruction fine-tuning. For ease of understanding, the SFT mentioned below is not distinguished from instruction fine-tuning.


2.1 Data Preparation

Data mainly comes from two aspects: one is manual annotation, and the other is automatic generation of training data through models such as ChatGPT.  The latter reduces the cost of manually constructing data sets and can generate a large number of training examples more quickly. Specifically, some basic instruction examples can be given to let the model generate similar new instruction and answer pairs, thus forming an automated training data generation process. For example, Stanford University's Alpaca project automatically generated 5,200 instruction-answer examples through ChatGPT, greatly improving the efficiency of training.


(1) Text data format



    "Instruction" : "" ,
    "Input" : "" ,   //The Input field is optional, and sometimes the Instruction part will contain the Input content
    "Output" : ""
}

// example:  
{
    "Instruction" : "Please help me translate a sentence" ,
    "Input" : "hello" ,            
    "Output" : "Hello"
},
{
    "Instruction" : "Please help me translate a sentence: hello" ,
    "Input" : "" ,  
    "Output" : "Hello"
}


(2) Data encoding format

First, the same tokenizer (such as BPE, SentencePiece) as used in pre-training is used for tokenization, that is, the text is split into tokens. Then the input and output are combined into a sequence, usually in the following way


[Instruction] + [Separator] + ([Input]) + [Separator] + [Output]

Here are some examples:

text:


User: Please explain the basic concept of quantum mechanics. \nAssistant: Quantum mechanics is a branch of physics that describes the behavior of microscopic particles...

Token sequence:


[Token_1, Token_2, ..., Token_n]

Label sequence (corresponding output):


[Token_k, Token_k+1, ..., Token_m]

Notice:

  • The maximum length of the input is limited by the context window size of the model (e.g. 2048 tokens for GPT-3).
  • If the sequence exceeds the maximum length, truncation or sliding window processing may be required.
  • Modern deep learning frameworks (such as PyTorch and TensorFlow) require that all samples have the same shape (for example, the dimensions of tensors are the same) during batch training, so that the parallel computing capabilities of GPUs can be fully utilized. Padding and truncation mechanisms are often used to make all sequences in a batch have the same length.


2.2 Training Objectives and Loss Calculation

The goal of SFT is to make the output generated by the model as close as possible to the labeled correct answer (label). Unlike predicting the next word in the pre-training phase, SFT needs to predict the entire output sequence.

SFT data contains input (context) and output (label), and the model needs to learn to generate the entire label sequence. The model optimizes the prediction accuracy by calculating the cross entropy between the predicted token distribution and the true label distribution .


Specific process of loss

  1. Forward propagation : After the input sequence and the output sequence are concatenated, they are fed into the model. The model calculates the probability distribution of tokens at each position through the Transformer layer.

  2. Calculate probability distribution : The last layer of the model output generates a logits matrix: the shape is [L, V], where L is the sequence length and V is the vocabulary size. Apply softmax to logits to get the token probability distribution at each position.

  3. Calculate the cross entropy loss : For each token in the output sequence, calculate the cross entropy between the predicted probability and the true label, and average the loss of each token as the overall loss.

    Where x is the input sequence and y_j is the jth token in the output sequence.

Through SFT, the model is further adjusted on the data of specific tasks, and can better understand and complete the actual task requirements, such as dialogue generation, question answering, machine translation, etc. The loss calculation method continues the token-by-token cross entropy in the pre-training stage, but adds the supervision information of the input-output pair, so that the generation results of the model are more in line with the task requirements.


3. Reward Model

The Reward Model (RM) is a crucial part of the reinforcement learning with human feedback (RLHF) process. Its role is to evaluate the quality of the text output by the large language model, give a score, and guide the model to better meet human preferences and needs in the subsequent generation process.

By interacting with human annotators, the reward model can provide feedback signals to help optimize the model's output, making the generated content more natural, authentic, and in line with user expectations, just like a senior high school teacher who specifically studies the college entrance examination questions of previous years to help students improve their grades .


3.1 Why do we need a reward model?

In the instruction fine-tuning (SFT) stage, although the model has been trained and has a certain language generation capability, its output may still not conform to human preferences. There may be "hallucination" problems (the content generated by the model is unreal or inaccurate) or "harmfulness" problems (outputting harmful, inappropriate or disturbing content).

This is because SFT only fine-tunes the pre-trained model with limited manually annotated data, and may not fully correct the potential erroneous knowledge or inappropriate output in the pre-training stage. In order to further improve the generation quality of the model and solve these problems, a reward model must be introduced to use reinforcement learning for further optimization.


3.2 Reinforcement Learning and Reward Model

The core idea of ​​reinforcement learning is to guide the learning of the model through a reward and punishment mechanism.  In RLHF (Reinforcement Learning with Human Feedback), the reward model is responsible for providing a reward score for each response generated by the model, helping the model learn which outputs meet human expectations and which do not.

The training data for the reward model usually comes from manually annotated ranking data , where annotators rank multiple generated answers and the reward model is trained based on these rankings.

Unlike traditional supervised learning, the reward model does not require a clear score to be given to each output directly, but compares multiple outputs through relative ranking to tell the model which outputs are better and which are worse. This relative ranking method can effectively reduce subjective differences in manual scoring, improve the consistency of annotations and the learning efficiency of the model.


3.3 Training Reward Model


(1) Training data (manually sorted data)

The training data for the reward model is usually generated by human annotators who sort the model outputs. During the training process, the annotators will sort the quality of multiple generated answers instead of scoring each answer. Specifically, given a question, the annotators will evaluate and sort the multiple answers to the question and use these sorted data as the training data for the reward model.

This relative ranking method is more efficient and consistent than directly scoring each answer, because the scoring will be affected by the individual subjective opinions of the annotators, and relative ranking reduces this influence, making the annotation results of multiple annotators more consistent.

Data Format:


//Data format based on comparison
{
    "input""text entered by the user" ,
    "choices" : [
        { "text""Candidate output 1""rank"1 },
        { "text""Candidate output 2""rank"2 }
    ]
}

//Data format based on ratings
{
    "input""text entered by the user" ,
    "output""Generate the output text of the model" ,
    "score"4.5
}

The inputs to the reward model include:

  • Input text : The prompt or question given by the user as context.
  • Output text : Generates the candidate answers of the model for quality evaluation.
  • Context and candidate text concatenation : The reward model usually concatenates the input (context) and each choice (candidate text) and then feeds them into the model. In this way, the model can understand the relationship between the generated text and the context and evaluate the quality of the generated text based on this relationship.

Here is a simple example:


//Original data
{
    "input""What is the capital of France?" ,
    "choices" : [
        { "text""The capital of France is Paris.""rank"1 },
        { "text""The capital of France is Berlin.""rank"3 },
        { "text""Paris is the capital of France.""rank"2 }
    ]
}
//Data that should be input to the model
[Input] What is the capital of France? [SEP] The capital of France is Paris.
[Input] What is the capital of France? [SEP] The capital of France is Berlin.
[Input] What is the capital of France? [SEP] Paris is the capital of France.


(2) Context Modeling

The reward model encodes the entire concatenated text based on architectures such as Transformer (such as BERT, RoBERTa). For each candidate text, the model generates a context-aware representation that takes into account input and the candidate choice The semantic relationship between them.


(3) Calculate scores or rankings

  • Regression tasks : If the task is of regression type (e.g. predicting a score), the reward model generates a predicted quality score for each candidate text.
  • Ranking tasks : If the task is based on ranking (e.g. selecting which candidate text has better quality), the reward model will ensure that high-quality texts are scored higher than low-quality texts by scoring all candidate texts, calculating and comparing their scores.


(4) Loss function

During training, the model calculates the loss by comparing the difference between the predicted score of the candidate text and the actual label (rank or score), and performs backpropagation optimization:

  • The regression task uses the mean squared error (MSE) loss to minimize the gap between the predicted score and the true score.
  • Sorting tasks usually use contrastive loss or ranking loss (such as Hinge Loss) to optimize the model so that candidate texts are correctly ranked.


3.4 Challenges of Reward Model

There are certain challenges in the design and training of reward models, which are mainly reflected in the following aspects:

  1. Diversity of human preferences : Different annotators may have different opinions on the same generated result, which requires the reward model to tolerate a certain degree of subjectivity and reduce bias through ranking learning.
  2. Model instability : Since the reward model is usually small, instability may occur during training. In order to improve the stability of training, the reward model usually adopts appropriate regularization techniques and optimization methods.
  3. Data quality and diversity : To ensure the effectiveness of the reward model, the training data needs to be diverse enough to cover different types of questions and answers. If the data quality is not high or too single, the model may not be able to learn effective scoring rules.


4. Reinforcement Learning with Human Feedback (RLHF)

Reinforcement Learning from Human Feedback (RLHF) is a method that combines reinforcement learning with human feedback, aiming to optimize the behavior and output of the model to make it more consistent with human expectations. By introducing human feedback as a reward signal to guide the model to better understand and satisfy human preferences, it can generate more natural outputs that are more in line with human intentions.

Just like college entrance exam candidates need to make adjustments and optimize their answering strategies based on feedback from mock exams, it is like the model in RLHF continuously optimizes its own behavior based on human feedback .


4.1 Core Components of the RLHF Framework

The RLHF framework consists of several key elements that work together to ensure that the model can be optimized based on human feedback:

Reinforcement Learning Algorithm (RL Algorithm) : The reinforcement learning algorithm is responsible for training the model to optimize its behavior. In RLHF, the commonly used reinforcement learning algorithm is Proximal Policy Optimization (PPO). PPO is an "on-policy" algorithm, where the model directly learns and updates through the current policy without relying on past experience. Through the PPO algorithm, the model is able to adjust the policy according to the reward signal and ultimately generate the desired output.

Action : In the RLHF framework, an action refers to the output text generated by the model based on a given prompt. Each output can be regarded as a choice made by the model when performing a task. The action space includes all possible tokens in the vocabulary and their permutations and combinations.

Environment : The environment is the scene where the model interacts with the outside world, providing the state, actions, and corresponding rewards that the model needs to perform tasks. In RLHF, the environment is the external world where the model generates outputs based on prompts and adjusts its behavior based on feedback.

State Space : All possible states presented to the model by the environment, usually prompts or contextual information input to the model.

Action Space : All possible actions that the model can perform, that is, all output text generated based on the prompt.

Reward Function : Depending on the output of the model, the reward function assigns a reward or penalty to it. Typically, these rewards are predicted by a trained reward model that evaluates the quality of the output based on human feedback.

Observation : An observation is an input prompt that the model receives when generating output. These prompts serve as the basis for the model to make decisions and perform tasks. The observation space refers to the possible input token sequences, i.e., the prompt text that the model processes.

Reward : The reward mechanism is a core component of the RLHF framework, responsible for assigning rewards or penalties based on the predictions of the reward model. The reward model is usually trained with a large amount of human feedback data to ensure that it can accurately predict human preferences for different outputs. Feedback data is usually collected by sorting and scoring the model output.


4.2 RLHF Practical Application: InstructGPT Training Process

The practical application of RLHF can be illustrated by the training process of InstructGPT (the predecessor of ChatGPT). The training process of InstructGPT is divided into three stages:

First, samples are collected from the prompt dataset, and the annotators write answers for the sampled prompts according to the requirements to form descriptive data (Demonstration Data). This data is used to fine-tune the GPT-3 model and train a supervised learning model (Supervised Fine-Tuning, SFT). The model is fine-tuned in a supervised manner through descriptive data, so that the model can generate answers that meet basic requirements.

Next, we sample from the prompt database to generate multiple model outputs, and the annotators score or sort these outputs to form comparative data, which are then used to train the reward model (RM). The reward model predicts the preference scores of different outputs, thereby helping the model generate higher quality outputs.

Finally, the PPO algorithm (Proximal Policy Optimization) is used to optimize the reward model. By sampling from the data set, the model outputs according to the initialization data obtained in the supervised learning phase, and the reward model scores each output. Finally, the PPO algorithm is used to adjust the model strategy so that it generates outputs that are more in line with human expectations. Through the RLHF (Reinforcement Learning from Human Feedback) method, the model can gradually improve its performance using human feedback, and finally train a model that can generate high-quality outputs.

Through these three stages of training, InstructGPT is able to generate outputs that are more in line with human needs and preferences, and ultimately form a conversational model similar to ChatGPT.


Conclusion

Training a large language model from scratch is a complex and challenging process that involves the design and optimization of multiple links. Through language modeling, instruction fine-tuning, reward model construction, and reinforcement learning with human feedback (RLHF) in the pre-training phase, a large language model that is efficient, flexible, and meets human needs can eventually be trained. In this process, the optimization of each step is crucial. Only with careful design and repeated experiments can the ideal effect be achieved.