Reasoning model distillation practice: using large models to improve small model capabilities

Model distillation technology allows small models to obtain the knowledge essence of large models!
Core content:
1. Introduction to model distillation technology: How to make small models learn the knowledge of large models
2. Three-step quick method: construct distillation data, special training small models, and test acceptance
3. Python code example: Get distillation data from high-quality open source mathematical data sets
The popularity of DeepSeek-R1 has attracted more developers' attention to model distillation technology, which is the secret to allowing small models to learn the essence of large models. Today, we will use the Qwen2.5-1.5B small model (equivalent to a junior high school student in the AI world) to practice!
? What is model distillation?
Just like ordinary students learning problem-solving ideas from top students:
? Three-step quick method:
Create "Student Notes" (Construct distillation data)
Special training small model (training phase)
Test acceptance (model evaluation)
Following this process, even a small model can learn from the real masters! No expensive hardware is needed, and regular graphics cards can be used for training. Hurry up and try this "special training" secret in the AI world~
In order to provide small models with suitable learning materials, we need to obtain distilled data from high-quality open source mathematical datasets, such as AI-MO/NuminaMath-CoT.
AI-MO/NuminaMath-CoT:
https://www.modelscope.cn/datasets/AI-MO/NuminaMath-CoT/summary
The following shows how to use ModelScope's online model inference service ( https://www.modelscope.cn/docs/model-service/API-Inference/intro), use DeepSeek-R1 as the teacher model, and obtain the solution process and reference answer of a math problem through prompt construction. The following is a Python code example:
from openai import OpenAIimport ossystem_prompt = ( 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. ' 'The assistant first thinks about the reasoning process in the mind and then provides the user ' 'with the answer. The reasoning process and answer are enclosed ' 'within <think> </think> and <answer> </answer> tags, respectively, ' 'ie, <think> reasoning process here </think> <answer> answer here </answer>.')prompt_template = r'{question}\nPlease reason step by step, and put your final answer within \boxed{{}}.'question = 'Find all real numbers \( x, y, z \) such that \[ x + y + z = 3, \quad x^2 + y^2 + z^2 = 3, \quad x^3 + y^3 + z^3 = 3 \]'client = OpenAI( api_key=os.getenv("MODELSCOPE_SDK_TOKEN"), # Please replace it with your ModelScope SDK Token base_url="https://api-inference.modelscope.cn/v1/")response = client.chat.completions.create( model="deepseek-ai/DeepSeek-R1", # ModleScope Model-Id messages=[ { 'role': 'system', 'content': system_prompt }, { 'role': 'user', 'content': prompt_template.format(question=question) } ], stream=True)for chunk in response: if hasattr(chunk.choices[0].delta, 'reasoning_content'): print(chunk.choices[0].delta.reasoning_content, end='', flush=True) if hasattr(chunk.choices[0].delta, 'content'): print(chunk.choices[0].delta.content, end='', flush=True)
Output:
Okay, so I need to solve this system of equations:x + y + z = 3,x² + y² + z² = 3,x³ + y³ + z³ = 3.Hmm, let's see. These are symmetric equations, so maybe I can use some symmetric properties or formulas. Let me recall that for symmetric equations, using variables like sums and products might help. Let me denote S = x + y + z, which is given as 3. Then, the sum of squares is given by x² + y² + z² = 3. I also know that the sum of squares can be related to S and the pairwise products. The formula is:x² + y² + z² = (x + y + z)² - 2(xy + yz + zx). = 9 - 2(xy + yz + zx). Let's solve for the sum of products:2(xy + yz + zx) = 9 - 3 = 6 ⇒ xy + yz + zx = 3.Okay, so now I have S = 3, and the sum of products is also 3. Now, moving on to the third equation, which is the sum of cubes: x³ + y³ + z³ = 3. There's a formula for the sum of cubes in terms of S, the sum of products, and the product xyz. The formula is:x³ + y³ + z³ = (x + y + z)³ - 3(x + y + z)(xy + yz + zx) + 3xyz.Plugging in the known values:3 = 3³ - 3*3*3 + 3xyz.Calculating each term:3³ is 27,3*3*3 is 27,So,3 = 27 - 27 + 3xyz ⇒ 3 = 0 + 3xyz ⇒ 3xyz = 3 ⇒ xyz = 1.So now I have S = 3, sum of products = 3, and product = 1. These are the elementary symmetric sums. Therefore, the monic cubic polynomial whose roots are x, y, z is:t³ - S t² + (xy + yz + zx) t - xyz = 0 ⇒ t³ - 3t² + 3t - 1 = 0.So, solving this cubic equation should give me all real numbers x, y, z. Let me try to factor this cubic. Maybe it has rational roots? By Rational Root Theorem, possible roots are ±1. Let's test t=1:1³ - 3*1² + 3*1 -1 = 1 -3 +3 -1 = 0. Yes, t=1 is a root. So, we can factor out (t -1). Let's perform polynomial division or use synthetic division.Dividing t³ -3t² +3t -1 by (t -1):Using synthetic division:1 | 1 -3 3 -1 1 -2 1 1 -2 1 0So, the quadratic factor is t² - 2t +1. Which factors further as (t -1)².Therefore, the cubic factors as (t -1)³ = 0. So, all three roots are 1. Therefore, the solution is x = y = z =1.Wait, but let me check if there could be other solutions. Suppose the cubic factors into (t-1)^3, which means that all three variables are 1. Let me verify if this satisfies the original equations.First equation: 1 +1 +1 =3, yes.Second: 1² +1² +1² =3, yes.Third:1³ +1³ +1³=3, yes. So that works.But wait, maybe there are other real solutions where variables are not all 1? Let's see. Since the cubic equation only has t=1 as a triple root, that's the only solution. Therefore, the only real solution is x=y=z=1.But let me think again. Suppose that two variables are equal to 1 and the third is different. Let me test that. ones. Alternatively, if one variable is not 1, would that work? Wait, suppose that x, y, z are all 1, which works. Suppose, for example, x=2, y=0, z=1. Then x+y+z=3. Check the sum of squares: 4 +0 +1=5 ≠3.So that doesn't work. Alternatively, maybe some other combination. But given that the cubic equation only has 1 as a root, multiplicity three, that suggests that all three variables must be 1.Alternatively, perhaps complex roots, but the problem asks for real numbers. So, since the only real root is 1 with multiplicity three, the only real solution is x=y=z=1. Therefore, the answer is all variables equal to 1. = z = 1 \). Thus, the numbers are \(\boxed{1}\), \(\boxed{1}\), and \(\boxed{1}\). \(\boxed{1}\).</answer>
We have processed a batch of math problems and generated a distilled dataset containing 4,000 samples. Each sample contains the question, the solution process, and the reference answer. We save it in JSONL format for subsequent use. The dataset preview is: https://www.modelscope.cn/datasets/modelscope/MathR/dataPeview.
Next, we will use the ms-swift ( https://github.com/modelscope/ms-swift) model training framework to train the Qwen2.5-1.5B model with this batch of data.
Example of training samples:
[ { "role": "user", "content": "A set of consecutive positive integers beginning with $1$ is written on a blackboard. One number is erased. The average (arithmetic mean) of the remaining numbers is $35\\frac{7}{17}$. What number was erased? \n$\\textbf{(A)}\\ 6\\qquad \\textbf{(B)}\\ 7 \\qquad \\textbf{(C)}\\ 8 \\qquad \\textbf{(D)}\\ 9\\qquad \\textbf{(E)}\\ \\text{cannot be determined}$\nPlease reason step by step, and put your final answer within \boxed{}." }, { "role": "assistant", "content": "\nOkay, let's see. I need to figure out which ....... Answer is B.\n\n**Final Answer**\n\\boxed{B}\n\n\nGiven a set of consecutive positive integers starting ...... Average of remaining numbers: \\(\\frac{2408}{68} = \\frac{602}{17} = 35 \\frac{7}{17}\\)\n\nThus, the number erased is \\(\\boxed{B}\\)." }]
Note: Due to the limitation of video memory, we use LoRA technology to fine-tune Qwen2.5-1.5B. LoRA is an efficient model fine-tuning method that can achieve adaptive adjustment of the model by adding low-rank matrices without changing the original model parameters. This can greatly reduce the training cost and time. If you have a stronger graphics card, you can consider using more training data and full parameter fine-tuning.
In the following command, we also use Swanlab ( https://github.com/SwanHubX/SwanLab) to visualize the training process, which can easily view the changes in indicators such as loss during training. Please replace the followingYOUR_SWANLAB_TOKEN
!CUDA_VISIBLE_DEVICES=0 \ swift sft \ --model Qwen/Qwen2.5-1.5B-Instruct \ --train_type lora \ --lora_rank 16 \ --torch_dtype bfloat16 \ --dataset 'modelscope/MathR:clean' \ --split_dataset_ratio 0 \ --max_length 4096 \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --learning_rate 1e-5 \ --gradient_accumulation_steps 16 \ --save_steps 100 \ --save_total_limit 10 \ --logging_steps 5 \ --report_to swanlab \ --swanlab_token YOUR_SWANLAB_TOKEN \ --swanlab_mode cloud
Run the following command in the console to communicate with the trained model and understand the model effect:
Note:
adapters
Replace the parameter with the model path you trained.--stream
The parameters are set totrue
Indicates the use of streaming reasoning,--infer_backend
The parameters are set topt
Indicates the use of PyTorch as the inference backend,--temperature
Setting the parameter to 0 means no randomness is introduced.--max_new_tokens
The parameter is set to 2048, indicating the maximum number of tokens generated.
swift infer \--adapters 'output/Qwen2.5-1.5B-Instruct/v11-20250415-120200/checkpoint-81' \--stream true \--infer_backend pt \--temperature 0 \--max_new_tokens 2048
After training, we use a new set of math problems to evaluate the model. Here we use the gsm8k dataset (math problem dataset) for evaluation. You can view the dataset here ( https://www.modelscope.cn/datasets/modelscope/gsm8k/dataPeview)
from evalscope import run_task, TaskConfigtask_config = TaskConfig( model="Qwen/Qwen2.5-1.5B-Instruct", # Original model datasets=["gsm8k"], # Dataset name dataset_args={ "gsm8k": {"few_shot_num": 0}, # few_shot_num: 0 means not using few-shot }, generation_config={ "max_new_tokens": 4096, # Maximum number of tokens generated "temperature": 0, # Generated temperature coefficient, 0 means greedy search }, eval_batch_size=10, # Batch size limit during evaluation=100 # The size of the evaluation dataset, extract the first 100 data for evaluation)run_task(task_config)
The results are as follows:
In order to evaluate the trained model, you need to run the following command to merge the trained lora parameters back to the original model to get a new modelQwen2.5-1.5B-Instruct
and save it tocheckpoint-xxx-merged
directory.
!swift export \ --adapters /mnt/data/data/user/maoyunlin.myl/tools/course/distill/output/Qwen2.5-1.5B-Instruct/v11-20250415-120200/checkpoint-81 \ --merge_lora true
# Test the model after distillation training from evalscope import run_task, TaskConfig # Remember to replace the model path below task_config = TaskConfig( model="/mnt/data/data/user/maoyunlin.myl/tools/course/distill/output/Qwen2.5-1.5B-Instruct/v11-20250415-120200/checkpoint-81-merged", datasets=["gsm8k"], dataset_args={ "gsm8k": {"few_shot_num": 0}, }, generation_config={ "max_new_tokens": 4096, "temperature": 0, }, eval_batch_size=10, limit=100) run_task(task_config)
The results are as follows:
Visualize the results
The training results show that the model's answer accuracy has increased by 12%, which is a significant improvement. We can also use visualization tools to further analyze the model's reasoning process to help us better understand the model's decision logic.
import osos.environ['GRADIO_ROOT_PATH'] = f"/{os.environ['JUPYTER_NAME']}/proxy/7860"print(os.environ['GRADIO_ROOT_PATH'])
!evalscope app
In this tutorial, we demonstrated in detail the complete process of how to use a teacher model to distill a small model. The content covers three key links: data construction, model training, and model evaluation. Through this method, you can efficiently train your own small model. I hope this tutorial can help you master this technology and apply it flexibly in actual projects!