The evolution history and implementation principle of large-scale model distillation technology

Written by

Clara Bennett

Updated on:June-28th-2025

“ Knowledge distillation technology is a way to lightweight and miniaturize models, and its effect is far beyond our imagination .”

In the field of deep learning, model compression and deployment is a very important research topic due to the huge cost and computing power requirements of the model; therefore, how to miniaturize the model has become a problem that needs to be solved urgently.

Therefore, a technology is applied to the process of model miniaturization. This technology is called knowledge distillation, and what we usually talk about is large model distillation technology.

Of course, knowledge distillation technology is not a new technology. It was proposed by Nobel Prize winner Hinton and others in 2015. After chatGPT made the large model technology popular, knowledge distillation once again entered the public eye.

But if we say that the reason why knowledge distillation became so popular is the release of DeepSeek, we all know that DeepSeek is the light of my country's model; and the main problem it solves is the training cost of the model; but perhaps few people know that DeepSeek actually uses knowledge distillation technology and is the DeepSeek model distilled from Alibaba's Thousand Questions series.

So, what is distillation technology, and what is the history and implementation principles of distillation technology?

Distillation Technology

The distillation technology was proposed by Nobel Prize winner Hinton in 2015, but strictly speaking, Hinton only optimized the distillation technology based on the work of his predecessors.

Distillation technology - The current definition of distillation technology is that the "knowledge" (such as the relationship between categories, feature distribution) learned by the teacher model (large model) through training data is refined into the student model (small model).

Simply put, distillation technology is a teacher teaching students. Before distillation technology, model training needs to start from scratch; that is, the model parameters need to be randomly initialized; this is similar to learning everything by yourself since childhood, without anyone teaching you.

Obviously, this way of learning is inefficient; therefore, a new profession has emerged - teachers; their role is to teach you knowledge and experience based on their own learning, so that the speed, efficiency and accuracy of learning will be greatly improved.

Model distillation is based on this theory, using a trained large model to "teach" a simple small model; because it stands on the shoulders of giants, the distilled small model is better than the large model in terms of performance and response speed.

Of course, as a popular technology, the implementation principle of distillation technology is not as simple as everyone imagines. First of all, before Hinton proposed knowledge distillation in 2015, model distillation had already been used by some people. However, the distillation technology at that time was relatively simple, and only the prediction results of the model were learned at the output layer. This method is called hard target.

This is just like what teachers sometimes say in school, if you really can't understand it, just remember it, you don't need to know why; but this will create a problem, that is, you can only learn the same question or a very similar question, but you may not be able to solve a new question.

Therefore, teachers often say that we should not only learn knowledge, but more importantly, we should learn how to learn. Therefore, the knowledge distillation proposed by Hinton is similar to a learning method. It learns the probability distribution or thinking process of predicting data by a large model, rather than just remembering the answer. This method is called a soft target.

From the current distillation technology, distillation can be divided into many different situations, such as output layer distillation, intermediate layer distillation, self-distillation and other different forms. But no matter what form of distillation, its purpose is only one, that is, to let the student model learn the "knowledge" of the teacher model.

Implementation principle

The implementation principle of knowledge distillation mainly includes two aspects: knowledge transfer and soft labeling:

Knowledge transfer : The "knowledge" (such as the relationship between categories and feature distribution) learned by the teacher model (large model) through training data is refined into the student model (small model).

Soft Labels : The probability distribution (non-hard labels) output by the teacher model contains more information, such as "cats and dogs have similar characteristics", and the student model learns generalization ability by imitating these soft labels.

Model distillation uses temperature T to control the relevance of soft labels. The higher the temperature, the higher the relevance of soft labels, and the lower the temperature, the lower the relevance of soft labels.