A popular understanding of the concepts of distillation and quantization of large models

Written by

Jasper Cole

Updated on:June-13th-2025

1. Model Distillation: Knowledge Inheritance and Simplification1.1 Why do we need model distillation?1.2 How is model distillation achieved?1.3 Advantages of Model Distillation2. Model Quantization: Reduce Precision and Improve Efficiency2.1 Why do we need to quantize models?2.2 What are the methods for model quantization?2.3 Advantages of Model Quantization3. Summary

As big models are booming, the size and complexity of models are increasing. However, this also brings problems such as high computing costs and large storage requirements. In order to enable big models to run efficiently on more devices, model distillation and quantization technologies have emerged. These two technologies are like "slimming down" and "optimizing" big models. Today, let's take a deeper look at them.

1. Model Distillation: Knowledge Inheritance and Simplification

Model distillation is a knowledge transfer technology. Simply put, it is to transfer the knowledge of a large, complex and high-performance "teacher model" to a small and efficient "student model". It is like an experienced teacher guiding students so that students can quickly grasp the essence.

1.1 Why do we need model distillation?

Although large models have excellent capabilities, they are "huge" and require high equipment to run. They are difficult to use in resource-constrained scenarios such as mobile phones and embedded devices. Directly training small models often produces unsatisfactory results because small models have limited learning capabilities and cannot capture enough knowledge details. Model distillation provides a solution to this contradiction.

1.2 How is model distillation achieved?

Training the teacher model : First, we need to carefully train a powerful teacher model. This model usually has a complex structure and a large number of parameters, just like a knowledgeable and experienced scholar who can accurately grasp all kinds of knowledge.

Generating soft labels : The teacher model predicts the training data, and the output is not a simple "correct answer" (hard label), but a probability distribution of each category, which is a soft label. For example, in image recognition, for a picture, the teacher model can not only judge that it is a "cat", but also give richer information such as 80% like a cat, 15% like a leopard, and 5% like a dog. These soft labels contain the teacher model's confidence information for each category, which is a reflection of its knowledge.
Training the student model : The student model is trained using the soft labels generated by the teacher model. During the training process, the loss between the student model's own predictions and the true labels (student loss) and the difference between the student model output and the teacher model output (soft labels) (distillation loss) are usually combined. By optimizing the loss function, the student model continuously adjusts its parameters to make its output as close to the teacher model's output as possible. This process is like a student gradually mastering knowledge by imitating the teacher's problem-solving ideas and thinking methods.
Fine-tuning : After distillation, the student model is further fine-tuned to improve its performance.

1.3 Advantages of Model Distillation

Model compression : The student model is much smaller than the teacher model, which greatly reduces the number of model parameters and computational complexity. It is suitable for deployment on devices with limited resources, such as mobile phones and IoT devices, making it possible for these devices to realize intelligent applications.

Close performance : By learning the knowledge of the teacher model, the performance of the student model can approach or even exceed the teacher model in some cases. For example, DistilBERT, a distilled version of BERT, is only 40% of BERT in size, but its performance can reach 97% of BERT, and it performs well in natural language processing tasks.

Strong generalization ability : Soft labels provide more information, allowing the student model to perform better when faced with new data, and to draw inferences from one example, thus improving the generalization performance of the model.

The application scenarios of model distillation are very broad. In the field of natural language processing, the emergence of lightweight models such as DistilBERT and TinyBERT allows mobile phones and other devices to run NLP tasks smoothly; in computer vision, large convolutional neural networks can be distilled into lightweight models and applied to scenarios such as mobile phone photography and face recognition; in the field of edge computing, low-power devices in scenarios such as smart homes and autonomous driving can also realize AI functions thanks to model distillation technology.

2. Model Quantization: Reduce Precision and Improve Efficiency

Model quantization is a technology that compresses models by reducing model parameters and computational representation accuracy. Its core is to convert floating-point parameters in the model into low-precision integers (such as 8 bits or lower) to reduce storage requirements and computational costs.

2.1 Why do we need to quantize models?

In deep learning models, the amount of multiplication and addition calculations is huge, and usually a powerful computing platform such as GPU is required to achieve real-time calculations. This is too costly and energy-consuming for end products (such as mobile phones, smart watches, etc.). In addition, the huge number of parameters in large models places higher requirements on memory access and computing power. Compared with floating-point models, quantized fixed-point models occupy less memory and have higher fixed-point computing power. Therefore, model quantization has become an important means to effectively reduce the amount of calculations and improve model operation efficiency.

2.2 What are the methods for model quantization?

Post-training quantization (PTQ) : quantization is performed after model training is completed. The optimal number of quantization bits is determined through statistical analysis, and weights and activation values are quantized. Specifically, it is divided into dynamic offline quantization and static offline quantization.

Dynamic offline quantization : Only the weights of specific operators in the model are mapped from FP32 type to INT8/16 type, and the bias and activation functions are dynamically quantized during the inference process. The scaling factors are dynamically calculated based on different input values.
Static offline quantization : Use a small amount of unlabeled calibration data and use methods such as KL divergence to calculate the scaling factor. Unlike dynamic quantization, the statically quantized model has a "calibrate" process before use, that is, calibrating the scaling factor. Post-training quantization does not require modification or retraining of the model architecture. The operation is simple and efficient, but it may cause certain quantization losses.

Quantization-aware training (QAT) : Adding quantization noise during training allows the model to adapt to low-precision representations during the training phase, thereby improving the performance after quantization. This method simulates the effect of quantization during training and incorporates quantization loss into the objective function of the model to maintain model performance. However, this will reduce the training speed but achieve higher accuracy. In most cases, PTQ, which does not require training, is preferred. If PTQ cannot meet the accuracy requirements, QAT is considered.

Mixed precision training : Combine data types of different precisions for training to balance accuracy and computational efficiency. For example, in face recognition, the key eye area is calculated with FP16 high precision, while the background is calculated with INT8 low precision; in speech recognition, keywords are calculated with 16 bits and silent segments are calculated with 4 bits. This can maximize computational efficiency while ensuring overall performance.

2.3 Advantages of Model Quantization

Smaller model size : Taking 8-bit quantization as an example, compared with 32-bit floating-point numbers, the model size can be reduced to a quarter of the original, greatly reducing the storage requirements of the model, making both model storage and updating more convenient.

Lower power consumption : Moving 8-bit data is 4 times more efficient than moving 32-bit floating-point data. Since memory usage is proportional to power consumption to a certain extent, quantization can effectively reduce device power consumption and extend battery life, which is especially important for mobile devices.

Faster computing speed : Most processors support faster processing of 8-bit data, and the advantage is more obvious if it is binary quantization. On some computing platforms, the computing power of INT8 is generally higher than FP16. By quantizing weights and activation values, INT8 computing power can be used to improve the model reasoning speed, especially in scenarios with long batches and long contexts on the server side. By quantizing weights and KV cache, memory access can be reduced and computing efficiency can be improved.

Model quantization technology has broad application prospects in the field of deep learning, and is especially suitable for edge devices and real-time application scenarios. It can significantly reduce the storage and computing requirements of the model while maintaining high accuracy, allowing large models to run efficiently on more devices.

3. Summary

Model distillation and model quantization are important technologies for optimizing large models, solving the problems faced by large models in practical applications from different perspectives. Model distillation transfers knowledge to allow small models to learn the knowledge of large models, thereby achieving model compression and performance improvement; model quantization reduces the storage and computing costs of the model by reducing precision, thereby improving inference speed.