Single card 4090 uses Deepseek-R1 GPRO to train QWen basic model complete steps

Written by
Jasper Cole
Updated on:July-15th-2025
Recommendation

Master the new method of efficiently training large AI models with a single 4090 GPU.

Core content:
1. The LoRA method for fine-tuning the DeepSeek-R1-32B model with a single 4090 GPU
2. The key idea of ​​training the QWen2.5 basic model with GRPO with a single 4090 GPU
3. The complete steps and performance test results for training the QWen basic model

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)


They are all straightforward and straightforward. Here is a brief explanation:

A : It is based on the DeepSeek-R1-Distill-Qwen-32B model Deepseek-R1. LoRA is a specific method of PEFT (Parameter-Efficient Fine-Tuning) parameter efficient fine-tuning. Simply put, it locks most of the weights of the model and only uses data sets in specific fields to change a small number of weights to improve the effect. The advantage is that it saves resources. With the support of unsloth optimization and int4 quantization, on a 4090 card with limited 24G video memory, you can train a model with a weight file of 62G.

B : Based on Qwen's basic original model Qwen2.5-3B, an inference model is trained. The key method of Deepseek-R1 is applied, and a simple reward function plus GRPO algorithm are used for reinforcement learning to give the model better inference capabilities. In the process, all weights of the original model will participate in the training, so the demand for video memory and computing power is higher. Even with so many optimization buffs superimposed by unsloth, it can still only train to a 3B model. The video memory of the 7B model will explode in actual measurement, and it is still being optimized.


In the previous article,  a single GPU 4090 was used to train the QWen2.5 basic model through GRPO to reproduce the key ideas of Deepseek-R1  . In order to quickly complete the test, the maximum number of training steps max_steps was set to 250. Later, the max_steps setting was cancelled, and the trainer was allowed to calculate the number of steps based on the amount of data. After a complete run, the results are as follows:


  • Total steps: 22419

  • Total training epochs: 3.0 

  • Training time: 17.3 hours in total (62352.0686 seconds)


Resource peak:
+---------------------------------------------------------------------------------------+|  NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2      || -----------------------------------------+----------------------+----------------------+GPU Name Persistence -M |  Bus-Id Disp.A  Volatile Uncorr ECC  |                    Fan Temp Perf Pwr :Usage/Cap  |          Memory-Usage  GPU - Util Compute  M. |                 | | |                MIG  M. |                      |==========================================+=======================+======================||    0 NVIDIA GeForce RTX 4090 Off  |  00000000:01:00.0 Off  |                   Off  |             30 % 56C     P2              251W / 450W |   18142MiB / 24564MiB  |      93 %       Default  || | | N/A |                      +-----------------------------------------+----------------------+----------------------+

Training start log:

INFO  02 - 18  09: 44 : 59  model_runner.py: 1115 ] Loading model weights took  5.7701  GBINFO  02 - 18  09: 44 : 59  punica_selector.py: 18 ] Using PunicaWrapperGPU.INFO  02 - 18  09: 45 : 00  worker.py: 267 ] Memory profiling takes  1.43  secondsINFO  02 - 18  09: 45 : 00  worker.py: 267 ] the current vLLM instance can  use  total_gpu_memory ( 23.65 GiB)  x  gpu_memory_utilization ( 0 . 59 ) =  13.96 GiBINFO  02 - 18  09: 45 : 00  worker.py: 267 ] model weights take  5.77 GiB; non_torch_memory takes  0 .08GiB; PyTorch activation peak memory takes  1.23 GiB; the rest of the memory reserved  for  KV Cache is  6.89 GiB.INFO  02 - 18  09: 45 : 01  executor_base.py: 110 # CUDA blocks: 12541, # CPU blocks: 10922INFO 02-18 09:45:01 executor_base.py:115] Maximum concurrency for 512 tokens per request: 391.91xINFO 02-18 09:45:04 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.Capturing CUDA graph shapes: 100%|██████████| 31/31 [00:21<00:00, 1.45it/s]INFO 02-18 09:45:26 model_runner.py:1562] Graph capturing finished in 21 secs, took 2.15 GiBINFO 02-18 09:45:26 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 27.19 secondsUnsloth 2025.2.9 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1   \\ /| Num examples = 7,473 | Num Epochs = 3O^O/ \_/ \ Batch size per device = 1 | Gradient Accumulation steps = 1\/Total batch size = 1 | Total steps = 22,419 "-____-" Number of trainable parameters = 59,867,136  0%| | 5/22419 [00:14<18:12:54, 2.93s/it]----------------------------- Question:

Training end log:

s/soft_format_reward_func ': 0.0, ' rewards/strict_format_reward_func ': 0.25, ' rewards/int_reward_func ': 0.25, ' rewards/correctness_reward_func ': 1.0, ' reward ': 1.7916667461395264, ' reward_std ': 1.8719420433044434, ' kl ': 0.4413377642631531, ' epoch ': 3.0}-------------------- Question:Nellie had 380 legos, but she lost 57 of them and gave her sister 24 legos. How many legos does she have now?Answer:299Response:<reasoning>Nellie had 380 legos initially. She lost 57 legos, so she now has 380 - 57 = 323 legos. She then gave her sister 24 legos, so she now has 323 - 24 = 299 legos.</reasoning><answer>299</answer>
Extracted:299{'loss ': 0.0023, ' grad_norm ': 0.38597264885902405, ' learning_rate ': 0.0, ' completion_length ': 87.5, ' rewards/xmlcount_reward_func ': 0.5, ' rewards/soft_format_reward_func ': 0.0, ' rewards/strict_format_reward_func ': 0.5, ' rewards/int_reward_func ': 0.5, ' rewards/correctness_reward_func ': 1.6666667461395264, ' reward ': 3.1666667461395264, ' reward_std ': 0.8164965510368347, ' kl ': 0.058432161808013916, ' epoch ': 3.0}{'train_runtime ': 62352.0686, ' train_samples_per_second ': 0.36, ' train_steps_per_second ': 0.36, ' train_loss ': 0.006079863988740294, ' epoch ': 3.0}

Test Question 1: Calculate pi

Original model answer:
Here is a possible way to calculate pi using a Monte Carlo simulation. Let's say we want to estimate pi with an accuracy of 5 decimal places (0.00001). We can do this by generating a large number of random points in the range [-1, 1] for both x and y coordinates. We can then count the number of points that fall within the circle of radius 1 centered at the origin (ie, points that satisfy x^2 + y^2 < 1). The ratio of the number of points that fall within the circle to the total number of points will be an estimate of pi/4. To get an estimate of pi with an accuracy of 5 decimal places, we would need to generate approximately 10^8 random points. Here is some sample Python code that implements this method:
import randomn = 10**8 # Number of random points to generatecount = 0 # Number of points that fall within the circle for i in range(n): x = random.uniform(-1, 1) y = random.uniform(-1, 1) if x**2 + y**2 < 1: count += 1pi = count / n * 4print("pi = %f" % pi)
This code generates 10^8 random points and counts the number of points that fall within the circle. It then calculates an estimate of pi and prints the result. Note that this method is not particularly fast or efficient, and other methods (such as the Chudnovsky algorithm) exist for computing pi with greater accuracy and speed.
The answer of the trained model:

<reasoning>

We can use the Monte Carlo method to approximate pi. In this method, we generate random points within a square and count how many fall within a quarter circle inscribed in the square. The ratio of the number of points inside the circle to the total number of points, multiplied by 4, gives an approximation of pi.

</reasoning>

<answer>

3.14

</answer>

It can be seen that the answers of the trained model meet the required format and are relatively concise, and it already has the ability to reason.

Test question 2: Which is bigger? 9.919 or 9.92?
Original model answer:

9.919 is smaller than 9.92.

】,【】,【】,【】,【】,【】,【】,...

(This time the original model answered correctly. The blank characters at the end should be pre-filled. Because it is not an instruct model, the format is a bit messy.

The answer of the trained model:

<reasoning>

9.919 is equivalent to 9.91900, while 9.92 is equivalent to 9.92000. Since 9.92000 is greater than 9.91900, 9.92 is bigger.

</reasoning>

<answer>

9.92

</answer>

It can be seen that the answers of the trained model fully meet the required format and have reasoning process. The answers are concise and accurate. The effect is significantly improved compared with the previous training of only 250 steps.