Understanding Pre-training of Large Models in One Article

Written by

Iris Vance

Updated on:June-19th-2025

Today, let’s talk about the pre-training of BERT and GPT to understand the fourth step of the big model: Pre-training.

Pre-training is the first stage of training large language models (such as BERT and GPT). Its core goal is to learn general language representation from massive unlabeled texts through self-supervised learning . The goal of this stage is to enable the model to master basic capabilities such as language grammar, semantics, and common sense , laying the foundation for subsequent fine-tuning.

1. BERT (MLM + NSP)

BERT pre-training: MLM and NSP

Based on the bidirectional architecture of the Transformer encoder, BERT learns contextual semantics through the masked language model (MLM) and next sentence prediction (NSP) tasks. MLM randomly masks 15% of the input words, forcing the model to predict missing words from the bidirectional context, breaking through the limitations of traditional unidirectional models; NSP strengthens cross-sentence reasoning capabilities by judging whether the sentence pairs are coherent.

1. MLM (Masked Language Modeling)

In BERT pre-training, the model learns bidirectional context through the Masked Language Modeling (MLM) task , which randomly masks 15% of the words in the input text and predicts the word based on the context on the left and right sides of the masked word.

(1) Task: Randomly mask 15% of the words in the input text and require the model to predict the masked words.

(2) Example: The input sentence is “The cat sits on the [MASK]” and the model needs to predict that “[MASK]” is “mat”.

2. NSP (Next Sentence Prediction)

BERT uses the Next Sentence Prediction (NSP) task to input continuous sentences with a probability of 50% and random sentences with a probability of 50% . The training model learns the logical relationship between sentences to improve the performance of tasks such as question answering and text classification .

(1) Task: Determine whether two sentences are continuous (50% are continuous, 50% are random).

(2) Positive example: "I like cats" + "They are cute."

(3) Negative example: "I like cats" + "The sky is blue."

3. Learning suggestions for beginners:

Phase 1: Introduction to Theory (2-3 days)

(1) Purpose: To understand the design motivation and core logic of MLM and NSP.

(2) BERT paper: Focus on Section 3 (Pre-training task design) of BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

(3) Analogy: MLM is like playing a “fill in the blanks game” where you need to guess the masked word based on the context (e.g. “I like [MASK]” → “cats”); NSP is about determining whether two sentences are from the same article (e.g. “I like cats” + “They are cute.” are continuous, while “I like cats” + “The sky is blue.” are random).

Phase 2: Code Reproduction (5-7 days)

(1) Objective : Understand the implementation details of MLM and NSP through code.

(2) Code: No need to implement from scratch, you can directly call the pre-trained model based on the transformers library for fine-tuning.

from transformers import BertTokenizer, BertForMaskedLM, BertForNextSentencePrediction, Trainer, TrainingArgumentsimport torch # Load pretrained model and tokenizertokenizer = BertTokenizer.from_pretrained("bert-base-uncased")model_mlm = BertForMaskedLM.from_pretrained("bert-base-uncased") # MLM onlymodel_nsp = BertForNextSentencePrediction.from_pretrained("bert-base-uncased") # NSP only (legacy BERT support) # Example input (MLM)text = "The cat sits on the [MASK]."inputs = tokenizer(text, return_tensors="pt")outputs = model_mlm(**inputs)predicted_token_id = torch.argmax(outputs.logits[0, -1]).item()print(tokenizer.decode(predicted_token_id)) # Output predicted words (such as "mat") # Sample input (NSP) sentence1 = "I like cats." sentence2 = "They are cute." sentence3 = "The sky is blue." inputs_nsp = tokenizer(sentence1 + " [SEP] " + sentence2, return_tensors="pt") # Positive example inputs_nsp_neg = tokenizer(sentence1 + " [SEP] " + sentence3, return_tensors="pt") # Negative example model_nsp = BertForNextSentencePrediction.from_pretrained("bert-base-uncased") # Note: The new version of BERT has merged MLM+NSP

2. GPT (CLM)

Pre-training of GPT: Causal Language Modeling (CLM)

GPT's causal language modeling (CLM) is good at generating coherent text through one-way autoregression, but it cannot use the following context and is more suitable for "creation" tasks ; while BERT's masked language modeling (MLM) is similar to two-way "cloze" and is better at context understanding and is suitable for "understanding" tasks .

1. CLM (Causal Language Modeling)
In the pre-training of GPT, the model uses causal language modeling (CLM) to predict the next word through a one-way context (previous text only) (mathematically expressed as P(wt|w1,...,wt−1)) . Like "word-by-word dictation" or a "typewriter", it can only see the previous input content each time and gradually generate subsequent text.
The GPT series models (GPT-1/2/3/4) are all based on CLM and implemented through the unidirectional attention mask of Transformer.
(1) Task: Predict the next word based on the previous text, similar to the process of humans reading text word by word.
(2) Example: Given the input “The cat sits on the”, the model needs to predict that the next word is “mat”.

2. Learning suggestions for beginners:

Phase 1: Introduction to Theory (2-3 days)

(1) Purpose: To understand the core concepts and mathematical principles of CLM and its relationship with GPT.

(2) BERT paper: Focus on reading Section 2 (Model Architecture and Pre-training Tasks) of "Improving Language Understanding by Generative Pre-Training" (the original GPT-1 paper) and "Language Models are Unsupervised Multitask Learners" (the GPT-2 paper) to understand the extended application of CLM in generation tasks.

(3) Analogy: CLM is like when you are writing an essay, you can only decide the next word based on what you have written previously (such as “The cat sits on the [?]”) and cannot go back to revise or refer to the following text.

Phase 2: Code Reproduction (5-7 days)

(1) Objective : Understand the implementation details of CLM through code, including the unidirectional attention mask of Transformer.

(2) Code: There is no need to implement it from scratch. You can directly call the pre-trained model based on the transformers library for fine-tuning.

from transformers import GPT2LMHeadModel, GPT2Tokenizerimport torch # Load pre-trained model and tokenizertokenizer = GPT2Tokenizer.from_pretrained("gpt2")model = GPT2LMHeadModel.from_pretrained("gpt2")# Input text (CLM task)input_text = "The cat sits on the"inputs = tokenizer(input_text, return_tensors="pt")# Generate next wordoutputs = model.generate(**inputs, max_length=20, num_return_sequences=1)print(tokenizer.decode(outputs[0])) # Output complete sentence (such as "The cat sits on the mat and sleeps.")