Understanding DeepSeek in one article - Qwen1.5B based on R1 distillation

Written by

Caleb Hayes

Updated on:July-16th-2025

Knowledge distillation is a technique that transfers the knowledge of a complex large model (teacher model) into a smaller model (student model). In this process, the reasoning ability and knowledge of the teacher model are refined and transferred to the student model, so that the student model can have lower computational complexity and resource consumption while maintaining high performance.

Graphical Deep Learning - Data Distillation and Knowledge Distillation

DeepSeek successfully distilled the model capabilities of R1 into Qwen-1.5B through innovative distillation technology, carefully prepared data, effective distillation methods, model fine-tuning and optimization , making Qwen-1.5B have similar capabilities to o1-mini. This achievement brings new thinking and inspiration to the future development of AI technology.

Based on R1 distillation, Qwen1.5B is divided into two stages: preparation and distillation. In the preparation stage, teacher and student models are selected, and in the distillation stage, teacher knowledge is refined into the student model to reduce the computational cost.

1. Preparation

How to choose the teacher model and the student model ?The preparation stage is mainly about selecting and designing the model, that is, selecting a large neural network with excellent performance as the teacher model, and designing a small neural network with a relatively simple structure as the student model.

Teacher model: DeepSeek-R1 , a powerful reasoning model trained with large-scale reinforcement learning, which performs well in reasoning tasks such as mathematics and programming.
Student model: Qwen-1.5B , which is a model with fewer parameters and lower computational resource requirements, needs to learn the reasoning ability of R1 through the distillation process.

How to construct a distilled dataset ?Choose a dataset that is similar or related to the one used when the R1 model was trained. This dataset should contain enough samples to cover the various tasks and scenarios that the R1 model excels in. You can consider using the original dataset or a subset of it when the R1 model was trained as the distilled dataset.

By distilling the output of DeepSeek-R1, the efficient DeepSeek-R1-Distill-Qwen-7B can comprehensively surpass non-inference models such as GPT-4o-0513 in all aspects. DeepSeek-R1-14B surpasses QwQ-32B-Preview on all evaluation indicators, while DeepSeek-R1-32B and DeepSeek-R1-70B significantly surpass o1-mini on most benchmarks. These results demonstrate the powerful potential of distillation.

2. Distillation stage

How to perform knowledge distillation? The distillation stage is to extract and transfer the knowledge of the student model from the teacher model by softening the output of the teacher model, training the student model, and fine-tuning and optimizing, so as to reduce the computing cost while maintaining high performance.

Soften the teacher model output: Use the temperature parameter to soften the output of the teacher model to make it smoother and less uncertain. This helps the student model learn richer information.
Train the student model: Use the distilled dataset and the softened teacher model output as training targets to train the student model. During the training process, the performance of the student model is optimized by adjusting parameters such as loss function and learning rate.
Evaluation and fine-tuning: Regularly evaluate the performance of the student model during training and perform fine-tuning accordingly based on the evaluation results. This can help the student model better adapt to the requirements of the distillation task and improve its performance level.

What is DeepSeek's distillation system? DeepSeek's distillation system is divided into two types: progressive hierarchical and two-stage.Progressive hierarchical distillation uses three levels: structure, feature, and logic to migrate attention patterns, align hidden layer representations, and optimize decision paths. The two-stage distillation extracts reasoning capabilities through the teacher model, which is then encapsulated by the student model. At the same time, reinforcement learning is used to learn and correct errors in distillation to improve reasoning capabilities.

Progressive hierarchical distillation system : DeepSeek innovatively proposed this system, breaking through the traditional single-stage distillation model. It built a three-level distillation system, including structural distillation, feature distillation, and logical distillation, which respectively migrate attention patterns, align hidden layer representations, and optimize decision paths.
Two-stage distillation method : divided into teacher model and student model stages. In the teacher model stage, the reasoning ability of R1 is extracted; in the student model stage, the reasoning process is encapsulated into Qwen-1.5B through attention alignment loss and output distribution matching.
Reinforcement learning training : DeepSeek has innovated the way to train the inference model, using a reinforcement learning (RL) strategy instead of traditional supervised fine-tuning. This helps the model continuously learn and correct errors during the distillation process, thereby improving its inference capabilities.