Soft labels: the core mechanism and application of knowledge distillation

Written by

Clara Bennett

Updated on:July-09th-2025

introduction

In the previous article, we introduced the basic concepts and working principles of knowledge distillation technology, and showed how it can transfer the capabilities of large models to small models. This article will delve into the core mechanism of knowledge distillation - soft labeling, and reveal why it is the key to achieving effective knowledge transfer.

Next, we will analyze the limitations of traditional hard labels, how soft labels can make up for these deficiencies, and how they can be applied in actual training processes.

Soft Labels: The Core Mechanism of Knowledge Distillation

Soft labels are a core concept in knowledge distillation, which refers to the complete probability distribution of the output layer of the teacher model, rather than the real answer. Soft labels represent a shift from simple "yes or no" judgments to rich "similarity" probability distributions, enabling the student model to acquire the "dark knowledge" accumulated inside the teacher model. The "dark knowledge" here refers to the implicit knowledge learned by the large model during training but not directly reflected in the final classification results, such as the similarity relationship between categories, the structural information of the feature space, etc. Although this knowledge is not directly reflected in the hard labels, it is crucial for the model to understand the data and generalize.

In traditional machine learning, we usually use "hard labels" to train models. For example, the label of a cat picture may be [1,0,0,0], indicating that the picture belongs to the first category (cat). However, this simple representation method has obvious limitations:

Limited information : Hard labels only provide the "final result" and do not include the model's confidence differences for each category.
Loss of subtle judgments : The model’s judgment probabilities for different categories contain rich similarity information, which is completely ignored by hard labels.
Not conducive to knowledge transfer : During the distillation process, if only hard labels are used, the teacher model cannot pass on the complex judgment basis learned internally to the student model.

As shown in the figure below, when faced with a picture of a cat, the teacher model may make the following judgments: 60% probability that it is a cat, 20% probability that it is a small lynx, 15% probability that it is a tiger cub, and 5% probability that it is another animal, and finally concludes that the photo is a cat.

Image recognition process

Practical application of soft labels

During the knowledge distillation process, the soft labels retain the complete probability distribution of the teacher model output layer, making the knowledge transfer more comprehensive:

Hard labels only tell us: "This is a cat" [1,0,0,0]

The soft tag conveys complete information: "60% chance of being a cat, 20% chance of being a small lynx, 15% chance of being a tiger cub, 5% chance of being something else" [0.6, 0.2, 0.15, 0.05]

If we only use hard labels during the distillation process, the student model will not be able to obtain key insights from the teacher model's judgment process, such as "this picture of a cat has a certain similarity to a lynx." In this case, the student model can only learn the "correct answer" but cannot learn "why it is this answer" and "the relationship with other possible answers," resulting in incomplete knowledge transfer.

The value of soft tags

Soft labels solve the information loss problem in knowledge transfer by retaining the complete probability distribution, and have the following key values:

Transferring dark knowledge : The probability distribution reflects the "dark knowledge" inside the teacher model, including the similarity relationship between categories and the uncertainty of the model.
Provide richer learning signals : The student model learns not only "what" but also "how similar to what".
Improved generalization : By learning similarities between categories, the student model can better handle boundary cases and unseen samples.

After understanding the core value of soft labels, the next question is how to actually transfer this rich knowledge to the student model. Although soft labels contain valuable dark knowledge, it is still necessary to design a suitable training framework to ensure that the student model can effectively absorb this information. Below we will explore the specific training methods of the student model in knowledge distillation and see how to fully utilize the advantages of soft labels in practice.

Student model training method

The training process of the student model combines multiple learning objectives, as shown in the following figure:

Student model training method

Dual input sources

The student model (the "to be trained" model in the figure) receives two kinds of information at the same time during the training process:

Training data : original training samples, such as the image data shown on the left side of the figure
Distilled Knowledge : “Soft Label” Predictions from the Teacher Model

Dual learning objectives

The training process of the student model combines two goals:

Hard label learning : The student model needs to learn the correct classification based on the true labels ("actual labels") of the original training data. This part is usually calculated using the standard cross entropy loss function to ensure that the model can make accurate predictions.
Soft label learning : The student model also learns the output probability distribution ("prediction value") of the teacher model. These soft labels contain the teacher model's confidence in each category and reflect rich information such as the similarity relationship between categories. This part usually uses methods such as KL divergence to measure the difference between the output distribution of the student model and the teacher model.

Summarize

Soft labels replace traditional hard labels and record the teacher model's probability predictions for all categories, rather than just marking the correct category. This probability distribution contains rich "dark knowledge", especially the similarity relationship between categories.

During the knowledge distillation process, the student model learns from both the original data and the soft labels of the teacher model, and obtains stronger generalization capabilities by combining the dual learning objectives of hard labels and soft labels. This method significantly reduces computing requirements while maintaining model performance, allowing complex AI capabilities to be deployed and applied in resource-constrained scenarios such as mobile devices.