Understanding pre-training, fine-tuning, and contextual learning in one article

Written by

Iris Vance

Updated on:July-08th-2025

With the development of deep learning, big models have become an important pillar in the field of artificial intelligence. In the training and application of big models, pre-training (PT), fine-tuning (FT), and in-context learning (ICL) are three key technologies. This article will introduce the concepts, differences, and applications of these three technologies in big models.

Let's first look at a diagram that introduces the relationship between the three. In a broad sense, training includes pre-training, fine-tuning, and contextual learning. In a narrow sense, training refers specifically to the pre-training stage. The well-known large model companies Zhipu AI and Baichuan Intelligence provide models such as GLM-130B and Baichuan-2-192K, which are all pre-trained models. What we usually call alchemy generally refers to pre-training.

There are certain differences between pre-training, fine-tuning and contextual learning in terms of goals, methods and application scenarios.

Pre-training aims to learn common feature representations and improve the generalization ability of the model;

Fine-tuning is to adjust parameters for specific tasks to improve task adaptability;

Contextual learning focuses on the contextual relationships between data, improving the model's understanding of the data's internal structure and associations.

Pre-training: Building a general feature extractor

Pre-training (PT) refers to training a model on a large amount of data so that the model can learn a common feature representation. This training method can enable the model to have better generalization ability on specific tasks. Pre-training usually adopts unsupervised learning methods such as autoencoders, generative adversarial networks, etc. In large models, pre-training can significantly improve the performance of the model and reduce training time and computing resources. During pre-training, the model is exposed to a large amount of unlabeled text data, such as books, articles, and websites. Train a language model on a large amount of unlabeled text data. For example, a language model like GPT-3 is pre-trained on a dataset containing millions of books, articles, and websites. The goal of pre-training is to capture the underlying patterns, structures, and semantic knowledge present in the text corpus.

Unsupervised learning (UL) is a learning method in machine learning. In unsupervised learning, the model does not have a given label or target output, but learns the intrinsic structure and feature representation of the data from the input data. This learning method aims to discover patterns, associations or clusters in the data without the need for manually labeled data . Unsupervised learning is widely used in tasks such as preprocessing, feature extraction, dimensionality reduction and clustering. Common unsupervised learning algorithms include autoencoders, generative adversarial networks, K-means clustering, etc.

Fine-tuning : mission-specific optimization tool

Fine- tuning (FT) is to further train a pre-trained model for a specific task. Through fine-tuning, the model can achieve better performance on a specific task. Fine-tuning usually uses supervised learning methods to adjust model parameters using labeled data. In large models, fine-tuning can make full use of the general feature representation of the pre-trained model to achieve more efficient task adaptability.

Supervised learning is a method used in the field of machine learning where the model learns by using training data with known labels. In supervised learning, we have an input data set and a corresponding label or target output data set. The model tries to predict the label of new data by learning the relationship between the input and output. This method is called "supervised" because the model is supervised by the labels in the training data during the learning process. It relies on labeled training data, which is usually manually classified or annotated.

Supervised learning is useful in many tasks, such as classification, regression, speech recognition, image recognition, etc. Common supervised learning algorithms include support vector machines, decision trees, logistic regression, etc. Its goal is to train a model that can predict the corresponding label or output when given new unlabeled data.

SFT (Supervised Fine-Tuning) is a common supervised learning method . It is based on a pre-trained model and uses labeled data to fine-tune it to adapt to specific tasks. Through fine-tuning, the model can use pre-learned knowledge and quickly adapt to new tasks to improve performance and results. This method is widely used in various tasks, such as classification, regression, etc., and has achieved remarkable success.

RLHF (Reinforcement Learning with Human Feedback) is another supervised learning method . It is a reinforcement learning method that guides the behavior of intelligent systems through human feedback . In RLHF, humans provide feedback on the behavior of intelligent systems, such as which behaviors are correct and which behaviors are wrong. The intelligent system gradually improves its behavior strategy based on this feedback. This method alleviates the problem of requiring a lot of trial and error in traditional reinforcement learning, allowing intelligent systems to learn tasks more efficiently and quickly. RLHF is particularly suitable for complex or subjective tasks, such as language generation tasks, because it is difficult to clearly define loss functions in these tasks. Through human feedback, RLHF can generate outputs that are more in line with human intentions and preferences.

Contextual learning: capturing intrinsic relationships in data

The pre-trained GPT-3 model has a magical ability, which was later called: In-Context Learning, also known as situational learning .

This capability simply means that the pre-trained GPT-3 model does not need to be retrained when migrating to a new task. Instead, it only needs to provide a task description (this task description is optional) and then provide a few examples (task queries and corresponding answers, organized in pairs), and finally add the query that the model wants to answer. The above content is packaged together as the input of the model, and the model can correctly output the answer corresponding to the last query.

For example, let's say we want to use GPT-3 to do a translation task, translating English into French. The input format is as follows:

Contextual learning is very flexible. In addition to the translation tasks shown above, it can also be used for grammar modification and even code writing. The amazing thing is that during the training process of GPT-3, there is no explicit provision of training data such as task descriptions and examples in the test phase. Of course, the amount of training data for GPT-3 is very large (for example, it includes wikis, books and journals, discussions on reddit, etc.), and it may already contain data with similar structures for various tasks .