Deepseek sharks are going crazy, but 90% of people don’t know what the distillation technology it mentions is

The DeepSeek R1 model has sparked heated discussions in the AI community, but the distillation technology behind it is little known.
Core content:
1. The release of the DeepSeek R1 model caused a sensation, and its performance is close to the industry benchmark
2. The origin of the concept of distillation technology, proposed by Hinton and others
3. The principle of distillation technology: transfer knowledge from large models to small models to improve performance and efficiency
Last week, DeepSeek released the R1 model. It was like a bombshell that caused a sensation in the AI circle. As a domestic model, it performed well in various tests, and many indicators were close to or even exceeded industry benchmarks such as OpenAI's o1 series. As soon as the news came out, AI enthusiasts were instantly excited and discussed the magic of the R1 model on various platforms. Researchers also began to study its technical reports, trying to uncover the secrets behind its powerful capabilities.
However, when everyone was amazed at the powerful performance of R1, I, who majored in liberal arts, was stuck on the second line of the official introduction - "Distillation Technology" - what is this? Well, the big guys have gone to evaluate the performance of R1, so I'd better go and make up for it honestly.
What is distillation technology
The knowledge distillation technology in the field of AI is generally believed to have been proposed by Geoffrey Hinton, Oriol Vinyals and Jeff Dean in 2015.
They formally expounded the concept and technology of knowledge distillation for the first time in the paper "Distilling the Knowledge in a Neural Network". By transferring the knowledge of a complex teacher model to a simple student model, the student model can have a smaller model size and faster reasoning speed while maintaining high performance, providing an effective method to solve problems such as deep learning model deployment and efficiency.
Simply put, it is like in school, where teachers have rich knowledge and experience, and students constantly improve themselves by learning the knowledge taught by teachers. In AI, large models have learned a lot of "knowledge" after a large amount of data training. This knowledge includes the understanding of various data features, pattern recognition, etc. However, due to the small model's fewer parameters and relatively simple structure, it is difficult to achieve the performance of the large model if it is trained directly. At this time, through distillation, the small model can learn the "way of thinking" and "knowledge and experience" of the large model, while maintaining a certain performance, it can also have a faster reasoning speed and lower computing cost, just like a student who does not have as profound knowledge as the teacher, but by learning from the teacher, can also get good grades in the exam.
For example, in image recognition tasks, the large model can accurately identify various types of images. After the small model learns the knowledge of the large model through distillation technology, it can also perform well in image recognition. Moreover, on devices with limited computing resources such as mobile phones, the small model can run faster and achieve real-time image recognition, such as quickly identifying objects and scenes in photos.
How to use AI distillation technology
Preparation of teacher model and student model
The first step in implementing AI distillation technology is to prepare the teacher model and the student model. Just like a well-planned teaching activity, you must first have an experienced and knowledgeable "teacher" and a "student" who is eager to learn and has unlimited potential.
For the teacher model, it needs to be trained for a long time on a large amount of data so that it has a strong ability to accurately identify and analyze various complex data patterns. For example, in the field of image recognition, the teacher model may be trained on millions of images and can accurately identify different types of animals, landscapes, people, etc. It is like a well-trained expert who knows all kinds of situations. In practical applications, large convolutional neural networks such as ResNet-101 are often used as teacher models. Its pre-training on large-scale image datasets (such as ImageNet) enables it to have extremely excellent extraction and classification capabilities for image features.
The student model, on the other hand, is a model with a relatively simple structure and fewer parameters, and its computing resource requirements are relatively low. This is like a learner who has just started. Although he does not have as much experience and deep knowledge reserves as the teacher, he has the enthusiasm and potential to learn. The student model can be a simplified version of the teacher model, such as reducing the number of layers and neurons. Taking the model based on the Transformer architecture as an example, if the teacher model is a large model with multiple layers of Transformer blocks and a multi-head attention mechanism, then the student model may only contain fewer Transformer layers and heads. When preparing the student model, its parameters also need to be initialized. This can be done by random initialization, just like letting students start learning with a blank sheet of paper; it can also be initialized based on some pre-trained models, which is like providing students with some basic knowledge frameworks so that they can learn better on this basis.
The process of knowledge transfer
When both the teacher model and the student model are ready, the critical knowledge transfer stage begins. In this stage, the student model not only learns the labels of the original data, which is what we usually call "hard targets", but also strives to imitate the output of the teacher model, which is usually called "soft targets". It is a label in the form of a probability distribution and contains richer information than hard targets.
For example, in an image classification task, there is a picture whose true label is "cat", which is a hard target. After analyzing this picture, the teacher model may output a probability of 0.8 for "cat", 0.1 for "dog", and 0.1 for "other animals". This probability distribution is a soft target, which tells the student model that although this picture is most likely a cat, it is also likely to be other animals, which provides more learning information for the student model.
In the actual training process, in order to make the student model better imitate the output of the teacher model, a temperature parameter (usually represented by T) is usually introduced. This temperature parameter is like a knob to control the smoothness of the probability distribution of soft targets. When the temperature T is large, the probability distribution of soft targets will be smoother, and the probability difference between each category will become smaller, which means that the student model will learn more about the "comprehensive judgment" of the teacher model on each category, rather than just focusing on the category with the greatest probability; when the temperature T is small, the probability distribution of soft targets will be steeper, and the probability of the category with the greatest probability will be more prominent, and the student model will focus more on learning the teacher model's judgment on the most likely category. For example, in a 10-category task, when T = 1, the soft targets output by the teacher model may be [0.9, 0.05, 0.02, 0.01, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0], while when T = 10, the soft targets may become [0.5, 0.15, 0.1, 0.08, 0.07, 0.03, 0.03, 0.02, 0.01, 0.01], which clearly shows the impact of temperature on the distribution of soft targets.
T = 1
[0.9, 0.05, 0.02, 0.01, 0.01, 0.0, 0.0, 0.0, 0.0, 0.02]
T = 10
[0.5, 0.15, 0.1, 0.08, 0.07, 0.03, 0.03, 0.02, 0.01, 0.01]
In order to see how big the difference is between the student model and the teacher model, a loss function is also set. This loss function generally consists of two parts: the first part is the difference in probability distribution between the output of the student model and the teacher model, which is usually calculated using the KL divergence (Kullback-Leibler Divergence). The KL divergence can be used to determine how similar the two probability distributions are; the second part is the difference between the result predicted by the student model and the true label. In classification tasks, the cross entropy loss function is often used to calculate. By constantly adjusting the parameters of the student model, this loss function is minimized, so that the student model can slowly learn the knowledge and experience of the teacher model. In actual training, according to the specific tasks and experimental conditions, adjust the proportion of these two parts of the loss to achieve the best training effect. For example, if you care more about the accuracy of the student model's prediction of the true label, you can appropriately increase the proportion of the cross entropy loss; if you want the student model to better imitate the output of the teacher model, you can increase the proportion of the KL divergence loss.
The role of distillation technology
In terms of model deployment, in the actual application of AI, the deployment scenarios of models are diverse, and the computing resources and memory of many devices are very limited, just like the storage space of a small house is very limited. Mobile and IoT devices such as mobile phones and smart watches have a large gap in chip computing power and memory compared to large servers. If you run large models without distillation on these devices, AI applications will run slowly, freeze, or even fail to run.
Through knowledge distillation, small models can run smoothly on these resource-constrained devices while maintaining certain performance, and at the same time, they will not cause problems such as device overheating and excessive power consumption due to excessive computing requirements. In the field of autonomous driving, edge computing devices need to process a large amount of sensor data in real time. If large models are used for calculation, it is difficult to meet real-time requirements. The distilled small model can make decisions quickly with limited hardware resources to ensure driving safety.
In terms of reasoning speed, small models can quickly produce results when performing reasoning due to their small number of parameters and simple structure. Therefore, in some scenarios with extremely high real-time requirements, the fast reasoning ability of small models is particularly prominent, allowing users to enjoy a smoother and delay-free experience.
In terms of energy consumption, the operation of a small model is like an energy-saving light bulb, which consumes very little electricity; while a large model is like a high-power electric heater, which consumes a lot of energy. In data centers, a large number of servers run various AI models. If large models are used, the electricity cost will be a huge expense. Using small models through distillation technology can not only reduce energy consumption, but also reduce the demand for cooling equipment, further reducing operating costs. In some battery-powered devices, such as drones and mobile robots, reducing energy consumption means that the working time of the equipment can be extended and the efficiency of the equipment can be improved.