Single card 4090 uses Deepseek-R1 GPRO to train QWen basic model complete steps

Master the new method of efficiently training large AI models with a single 4090 GPU.
Core content:
1. The LoRA method for fine-tuning the DeepSeek-R1-32B model with a single 4090 GPU
2. The key idea of training the QWen2.5 basic model with GRPO with a single 4090 GPU
3. The complete steps and performance test results for training the QWen basic model
They are all straightforward and straightforward. Here is a brief explanation:
A : It is based on the DeepSeek-R1-Distill-Qwen-32B model Deepseek-R1. LoRA is a specific method of PEFT (Parameter-Efficient Fine-Tuning) parameter efficient fine-tuning. Simply put, it locks most of the weights of the model and only uses data sets in specific fields to change a small number of weights to improve the effect. The advantage is that it saves resources. With the support of unsloth optimization and int4 quantization, on a 4090 card with limited 24G video memory, you can train a model with a weight file of 62G.
B : Based on Qwen's basic original model Qwen2.5-3B, an inference model is trained. The key method of Deepseek-R1 is applied, and a simple reward function plus GRPO algorithm are used for reinforcement learning to give the model better inference capabilities. In the process, all weights of the original model will participate in the training, so the demand for video memory and computing power is higher. Even with so many optimization buffs superimposed by unsloth, it can still only train to a 3B model. The video memory of the 7B model will explode in actual measurement, and it is still being optimized.
In the previous article, a single GPU 4090 was used to train the QWen2.5 basic model through GRPO to reproduce the key ideas of Deepseek-R1 . In order to quickly complete the test, the maximum number of training steps max_steps was set to 250. Later, the max_steps setting was cancelled, and the trainer was allowed to calculate the number of steps based on the amount of data. After a complete run, the results are as follows:
Total steps: 22419
Total training epochs: 3.0
Training time: 17.3 hours in total (62352.0686 seconds)
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
| -----------------------------------------+----------------------+----------------------+
| GPU Name Persistence -M | Bus-Id Disp.A | Volatile Uncorr . ECC |
| Fan Temp Perf Pwr :Usage/Cap | Memory-Usage | GPU - Util Compute M. |
| | | MIG M. |
|==========================================+=======================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:01:00.0 Off | Off |
| 30 % 56C P2 251W / 450W | 18142MiB / 24564MiB | 93 % Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
Training start log:
INFO 02 - 18 09: 44 : 59 model_runner.py: 1115 ] Loading model weights took 5.7701 GB
INFO 02 - 18 09: 44 : 59 punica_selector.py: 18 ] Using PunicaWrapperGPU.
INFO 02 - 18 09: 45 : 00 worker.py: 267 ] Memory profiling takes 1.43 seconds
INFO 02 - 18 09: 45 : 00 worker.py: 267 ] the current vLLM instance can use total_gpu_memory ( 23.65 GiB) x gpu_memory_utilization ( 0 . 59 ) = 13.96 GiB
INFO 02 - 18 09: 45 : 00 worker.py: 267 ] model weights take 5.77 GiB; non_torch_memory takes 0 .08GiB; PyTorch activation peak memory takes 1.23 GiB; the rest of the memory reserved for KV Cache is 6.89 GiB.
INFO 02 - 18 09: 45 : 01 executor_base.py: 110 ] # CUDA blocks: 12541, # CPU blocks: 10922
INFO 02-18 09:45:01 executor_base.py:115] Maximum concurrency for 512 tokens per request: 391.91x
INFO 02-18 09:45:04 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████| 31/31 [00:21<00:00, 1.45it/s]
INFO 02-18 09:45:26 model_runner.py:1562] Graph capturing finished in 21 secs, took 2.15 GiB
INFO 02-18 09:45:26 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 27.19 seconds
Unsloth 2025.2.9 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.
==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1
\\ /| Num examples = 7,473 | Num Epochs = 3
O^O/ \_/ \ Batch size per device = 1 | Gradient Accumulation steps = 1
\/Total batch size = 1 | Total steps = 22,419
"-____-" Number of trainable parameters = 59,867,136
0%| | 5/22419 [00:14<18:12:54, 2.93s/it]----------------------------- Question:
Training end log:
s/soft_format_reward_func ': 0.0, ' rewards/strict_format_reward_func ': 0.25, ' rewards/int_reward_func ': 0.25, ' rewards/correctness_reward_func ': 1.0, ' reward ': 1.7916667461395264, ' reward_std ': 1.8719420433044434, ' kl ': 0.4413377642631531, ' epoch ': 3.0}
-------------------- Question:
Nellie had 380 legos, but she lost 57 of them and gave her sister 24 legos. How many legos does she have now?
Answer:
299
Response:
<reasoning>
Nellie had 380 legos initially. She lost 57 legos, so she now has 380 - 57 = 323 legos. She then gave her sister 24 legos, so she now has 323 - 24 = 299 legos.
</reasoning>
<answer>
299
</answer>
Extracted:
299
{'loss ': 0.0023, ' grad_norm ': 0.38597264885902405, ' learning_rate ': 0.0, ' completion_length ': 87.5, ' rewards/xmlcount_reward_func ': 0.5, ' rewards/soft_format_reward_func ': 0.0, ' rewards/strict_format_reward_func ': 0.5, ' rewards/int_reward_func ': 0.5, ' rewards/correctness_reward_func ': 1.6666667461395264, ' reward ': 3.1666667461395264, ' reward_std ': 0.8164965510368347, ' kl ': 0.058432161808013916, ' epoch ': 3.0}
{'train_runtime ': 62352.0686, ' train_samples_per_second ': 0.36, ' train_steps_per_second ': 0.36, ' train_loss ': 0.006079863988740294, ' epoch ': 3.0}
Test Question 1: Calculate pi
Here is a possible way to calculate pi using a Monte Carlo simulation. Let's say we want to estimate pi with an accuracy of 5 decimal places (0.00001). We can do this by generating a large number of random points in the range [-1, 1] for both x and y coordinates. We can then count the number of points that fall within the circle of radius 1 centered at the origin (ie, points that satisfy x^2 + y^2 < 1). The ratio of the number of points that fall within the circle to the total number of points will be an estimate of pi/4. To get an estimate of pi with an accuracy of 5 decimal places, we would need to generate approximately 10^8 random points. Here is some sample Python code that implements this method:
import randomn = 10**8 # Number of random points to generatecount = 0 # Number of points that fall within the circle for i in range(n): x = random.uniform(-1, 1) y = random.uniform(-1, 1) if x**2 + y**2 < 1: count += 1pi = count / n * 4print("pi = %f" % pi)
<reasoning>
We can use the Monte Carlo method to approximate pi. In this method, we generate random points within a square and count how many fall within a quarter circle inscribed in the square. The ratio of the number of points inside the circle to the total number of points, multiplied by 4, gives an approximation of pi.
</reasoning>
<answer>
3.14
</answer>
It can be seen that the answers of the trained model meet the required format and are relatively concise, and it already has the ability to reason.
9.919 is smaller than 9.92.
】,【】,【】,【】,【】,【】,【】,...
(This time the original model answered correctly. The blank characters at the end should be pre-filled. Because it is not an instruct model, the format is a bit messy.
<reasoning>
9.919 is equivalent to 9.91900, while 9.92 is equivalent to 9.92000. Since 9.92000 is greater than 9.91900, 9.92 is bigger.
</reasoning>
<answer>
9.92
</answer>