What is distillation technology

Written by
Silas Grey
Updated on:July-02nd-2025
Recommendation

Deeply understand the core technology of model compression and knowledge transfer - distillation technology.

Core content:
1. The origin and basic concepts of distillation technology
2. The difference between soft labels and hard labels and their impact on model training
3. The specific implementation steps and key parameters of distillation technology
4. The advantages of distillation technology and its advantages in practical applications
5. Common variants and extension methods of distillation technology

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

Distillation (KD) is a method of model compression and knowledge transfer, which aims to transfer the knowledge of a complex model (usually called the "teacher model") to a small model (usually called the "student model"). The core idea of ​​distillation technology is to enable the student model to significantly reduce the number of parameters and computational complexity while maintaining high performance by imitating the output or intermediate features of the teacher model.

The distillation technology was first proposed by Hinton et al. in 2015. It is mainly used in the field of deep learning and has now become an important tool for model compression, acceleration and transfer learning.

1. Basic principles of distillation technology

The core of distillation technology is to guide the training of student models through the "soft labels" of the teacher model. Different from traditional "hard labels" (i.e. real category labels), soft labels are the probability distribution output by the teacher model, which contains the relative relationship information between categories.

Soft Tags vs Hard Tags

Hard labels : For example, in an image classification task, the label may be [0, 0, 1, 0], indicating that it belongs to the third category.

Soft labels : The probability distribution output by the teacher model might be [0.1, 0.2, 0.6, 0.1], indicating the model’s confidence in each category.

The soft labels contain more information, such as the similarity between categories (e.g., category 2 and category 3 are more similar than category 1 and category 4), which can help the student model learn better.

2. Implementation methods of distillation technology

The implementation of distillation technology usually includes the following steps:

(1) Training the teacher model

The teacher model is typically a complex, high-performance model (such as a deep neural network).

The teacher model is trained on the training set until it reaches a high performance.

(2) Generating soft labels

Use the teacher model to perform inference on the training data and generate soft labels (probability distributions).

(3) Training the student model

The goal of the student model is to fit both hard labels and soft labels. The following figure shows the teacher-student framework of knowledge distillation.

The loss function usually consists of two parts:

Traditional losses (such as cross entropy): the difference between the student model output and the hard label.

Distillation loss : the difference between the student model output and the teacher model soft labels.

By adjusting the weights of the two parts of the loss, the degree of reliance of the student model on the soft labels can be controlled.

(4) Temperature

During the distillation process, a temperature parameter T is usually introduced  to adjust the smoothness of the soft labels.

The role of the temperature parameter is to soften the probability distribution, making it easier for the student model to learn the knowledge of the teacher model.

Where zi is the output logits of the teacher model and T  is the temperature parameter.

3. Advantages of distillation technology

Model compression :

The student model is usually much smaller than the teacher model, with significantly fewer parameters and computation.

Suitable for deployment on resource-constrained devices (such as mobile devices and embedded devices).

Accelerate reasoning :

The inference speed of the student model is faster and suitable for real-time applications.

Knowledge transfer :

The student model can learn richer knowledge from the teacher model, including the relationship between categories and generalization ability.

Improve small model performance :

Through distillation, small models can achieve performance close to that of large models, and in some cases even exceed the performance of directly trained small models.

4. Distillation Technique Variants

There are many variations and extensions to the distillation technique, here are some common ones:

(1) Feature Distillation

Not only imitate the output of the teacher model, but also imitate the feature representation of the intermediate layers.

By minimizing the feature differences between the intermediate layers of the student model and the teacher model, the student model can learn richer representations.

(2) Self-Distillation

The teacher model and the student model are different parts of the same model.

For example, using the output of a deep network to guide the training of a shallow network.

(3) Multi-Teacher Distillation

Use multiple teacher models to guide the training of student models.

By integrating the knowledge of multiple teacher models, the performance of the student model is improved.

(4) Online Distillation

The teacher model and the student model are trained simultaneously, rather than training the teacher model first and then the student model.

This approach can reduce training time.

5. Application scenarios of distillation technology

Mobile and embedded devices : Compress large models into small models that can fit on resource-constrained devices.

Real-time applications : Accelerate reasoning speed to meet real-time requirements (such as autonomous driving and real-time translation).

Model deployment : In edge computing scenarios, small models are used to reduce communication and computing overhead.

Transfer learning : Transferring knowledge from a pre-trained model to a smaller model for a specific task.

6. Challenges of distillation technology

Quality of the teacher model : The performance of the teacher model directly affects the performance of the student model.

Capacity of the student model : The capacity of the student model cannot be too small, otherwise it will not be able to fully learn the knowledge of the teacher model.

Training complexity : The distillation process requires additional computing resources (such as generating soft labels).

Task adaptability : The effect of distillation techniques in some tasks (such as generation tasks) may not be as obvious as in classification tasks.

Distillation technology is a powerful model compression and knowledge transfer method. By transferring the knowledge of complex models to small models, it can significantly reduce the model size and computational complexity while maintaining high performance. It has broad application prospects in areas such as mobile deployment, real-time applications, and edge computing. With the development of deep learning, variants and extension methods of distillation technology are constantly emerging, further improving its applicability and effectiveness.