Tutorial on reproducing DeepSeek R1 Zero with a single card is here!

Written by
Clara Bennett
Updated on:July-16th-2025
Recommendation

DeepSeek R1 Zero single card reproduction technology revealed, the perfect combination of cost-effectiveness and performance!

Core content:
1. Feasibility analysis of single card reproduction of DeepSeek R1 Zero
2. Unsloth+LoRA technology detailed explanation, how to optimize performance and reduce resource consumption
3. Environment construction and reproduction step guide, easy to get started

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)


A classmate asked me before: Host, your team's reproduction of R1 Zero is indeed very powerful, but it still consumes too much computing resources. We don't have 3 A800s. Is there a more economical and simpler way to learn the reproduction of R1 Zero? 
Yes, brother, yes, yes, there are nine solutions like this (just kidding). Today we will introduce an interesting method that allows you to reproduce DeepSeek R1 Zero on a single card, even with just a 4090 graphics card! 

Why can a single card reproduce?

You may ask: "Why did we need three A800s before but only one card is needed now? What's the black technology here?" The answer lies in the introduction of  Unsloth + LoRA .
Unsloth's core advantages are: 
  • Reinforcement learning algorithm optimization : It integrates multiple reinforcement learning (RL) algorithms and significantly improves the performance of large models during reasoning and fine-tuning by optimizing the underlying code (such as optimizing the computational graph and reducing redundant operations).
  • The latest quantization technology : greatly reduces video memory consumption, allowing large models that originally required multiple graphics cards to run on a single graphics card.
  • Full LoRA and QLoRA fine-tuning support : Even with limited video memory, R1 Zero can be reproduced with a small amount of resources.
This provides us with a lower cost and simpler implementation solution. Unsloth official blog mentioned: Only 7G VRAM is needed to train the Qwen2.5-1.5B model. 
Unsloth GitHub: https://github.com/unslothai/unsloth

Environment Construction

Installing Unsloth

The environment construction part has been explained in detail in the previous official account article. Here you only need to install Unsloth and the specified version of the trl library on the original basis. 
DeepSeek R1 Zero Chinese Reproduction Tutorial is here!
Additional note: In the multi-card training code previously released on the official account, the "thinking length reward function" was mistakenly introduced, and flash-attn was not used in the code. The Unlock-DeepSeek team has fixed the code. Please use the latest code in the warehouse. We have also updated a version of the training diagram. 

This article only shows the code parts that are different from the previous one. We also provide the complete training code, which can be obtained at the end of the article. 

Note: To be compatible with Unsloth, we need to install a specific version of trl. The specific command is as follows:


# Install unsloth and vllmpip install unsloth vllm
# Install the specified version of trl (compatible with unsloth)pip install trl== 0 . 15 . 0
Reference: https://docs.unsloth.ai/get-started/unsloth-notebooks

Configuration file modification

Most of the configurations are consistent with the previous  Datawhale-R1.yaml  file. To support single-card reproducing R1 Zero, we made the following adjustments: 

  • LoRA parameter settings : Enable LoRA fine-tuning, adjust the LoRA rank ( lora_r ) to 64 (common choices are 8, 16, 32, 64, 128, etc.), and set  lora_alpha  to 32.
  • Limit answer length :  Set max_completion_length  to 1024 to control the output length.
  • Optimizer adjustment: The optimizer is set to  adamw_8bit to speed up training.
Note: In order to save more memory, max_completion_length is set to 1024 here   , but this may affect the performance of the model. If you have sufficient resources, setting it higher (4096, 8196) may achieve better results, but it will also increase resource consumption. If memory is insufficient, you can adjust  vllm_gpu_memory_utilization to reduce it appropriately. In addition, if you have more resources, you can consider adjusting the optimizer  optim  to  adamw_torch , which will help to better reproduce the model. 

# LoRA parameter adjustmentlora_r: 64# LoRA rank, choose any number greater than 0! 8, 16, 32, 64, 128 are recommendedlora_alpha: 32 # LoRA alpha value
# Training parameterslearning_rate: 1.0e-5 # Learning rate, adjusted to 1e-5
# GRPO algorithm parametersbeta: 0.001 # KL penalty factoroptim: adamw_8bit # Use 8bit optimizer to speed up trainingmax_prompt_length: 256 # Maximum length of input promptmax_completion_length: 1024# Output answer length, including reasoning chainnum_generations: 4use_vllm: true # Enable vLLM to accelerate inferencevllm_gpu_memory_utilization: 0.4# vLLM GPU memory utilization (can be appropriately reduced when memory is tight)

LoRA fine-tuning reference: https://zhuanlan.zhihu.com/p/663557294

Start training

The code to start training is very simple. Since we only need a single card, there is no need to configure the complicated Accelerate library. Just run the following code to run it. 

python train_Datawhale-R1_unsloth.py --config Datawhale-R1_unsloth.yaml

Training code optimization interpretation

Based on the Unsloth framework, we simplified and optimized the original code. There are two main ideas: 

Patching to speed up training

Before executing the reinforcement learning training code, we added two lines of code to use  the PatchFastRL  function to "patch" certain RL algorithms (such as GRPO). This operation actually optimizes the computational graph at the bottom layer and reduces redundant calculations, thereby speeding up the training process. 

from unsloth import FastLanguageModel, PatchFastRLPatchFastRL("GRPO", FastLanguageModel)# Patch the GRPO algorithm

Improvements to GRPO training functions

In addition, we have also improved   the functions in grpo_function and made some optimizations, specifically in lines 14 to 34 of the code. Specifically, we added the following two methods:
  • Model loading : Load the pre-trained model through  the FastLanguageModel.from_pretrained  method, enable vLLM fast inference, and support 4-bit loading (or LoRA 16-bit).
  • PEFT fine-tuning : Use  the get_peft_model  method to apply LoRA fine-tuning to the model, specifying the target module, LoRA parameters, and gradient checkpoints to ensure effective training under limited video memory conditions.
# Define GRPO training functiondef  grpo_function (model_args: ModelConfig,dataset_args: DatasetArguments,training_args: GRPOConfig,callbacks: List,) :# Record model parameterslogger.info( f"Model parameters {model_args} " )# Record training/evaluation parameterslogger.info( f"Training/evaluation parameters {training_args} " )
# Load the model and tokenizer from the pre-trained modelmodel, tokenizer = FastLanguageModel.from_pretrained(model_name=model_args.model_name_or_path, # model name or pathfast_inference= True , # Enable vLLM fast inferenceload_in_4bit= True , # Whether to load the model in 4 bits, False means using LoRA 16 bitsmax_lora_rank=model_args.lora_r, # Set the maximum rank of LoRAmax_seq_length=training_args.max_completion_length, # Set the maximum sequence lengthgpu_memory_utilization=training_args.vllm_gpu_memory_utilization, # GPU memory utilization, can be reduced if memory is insufficientattn_implementation=model_args.attn_implementation, # Set the attention implementation method flash attention)
#PEFT Modelmodel = FastLanguageModel.get_peft_model(model,r = model_args.lora_r,target_modules = ["q_proj" , "k_proj" , "v_proj" , "o_proj" , # If OOM is insufficient, QKVO can be removed"gate_proj" , "up_proj" , "down_proj" ,],lora_alpha = model_args.lora_alpha, # Set the alpha value of LoRAuse_gradient_checkpointing = "unsloth" , # Enable unsloth gradient checkingrandom_state = training_args.seed, # set random seed)
If you encounter the Out of Memory problem, you can remove  "q_proj", "k_proj", "v_proj", "o_proj"  in  target_modules .

Reference: https://unsloth.ai/blog/r1-reasoning

Model quantization reference: LLM quantization comprehensive guide (8bits/4bits) https://zhuanlan.zhihu.com/p/671007819

Training results and some thoughts

The following are some screenshots of the training results, which are roughly similar to the results we reproduced for Tiny Zero and Mini R1, so we will not analyze them in detail here. 
Next, I will share some of my thoughts while learning R1 Zero (not a rigorous academic study, but a personal opinion, for reference only). 

Is the Aha moment a result of RL training?

Before I started my research, I was curious about the concept of Aha moment, which seemed to be a superpower that DeepSeek suddenly gained after RL training. But after reading oat's article in depth, I found that Aha moment did not appear out of thin air, and it may have been planted in the base model and SFT stage. What RL training does is more like an "amplifier". Through the designed reward mechanism, it maximizes the probability of the model generating aha moments. In other words, RL training transforms the model's original shallow self-reflection ability into a deeper and more effective thinking process. 
Reference OAT article: There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study: https://oatllm.notion.site/oat-zero 

Are longer thinking times more effective?

There is a common view in the community that RL training makes the model's output longer, thereby improving the effect. This view does make sense, because RL strengthens the model's thinking process, and generating more tokens is a natural result.
However, the question is: does longer thinking really mean better results? 
In the process of reproducing Tiny Zero, I observed an interesting phenomenon: the number of tokens showed a trend of first decreasing and then increasing . This phenomenon can be explained as follows: Initially, due to the existence of format rewards, the model must ensure the correct format. However, it is difficult to learn the format of the answer with an output that is too long, and it will contain many tokens that are useless for solving the task. Therefore, the number of tokens will naturally decrease first. The model first learns a simple format, retains tokens that are conducive to correct calculations, and then learns complex calculations, first simple and then complex; as training progresses, the model begins to make more attempts and reflections to get the correct answer, and the output length gradually increases and stabilizes. This observation also confirms the conclusion of OAT: there is not necessarily a linear correlation between output length and the quality of self-reflection.

S1 Some conclusions and thoughts of the article

Recently, I also read the S1 article of Fei-Fei Li's team, which analyzes in detail the differences between their methods and R1 Zero. In general, S1  is trained with a small amount of high-quality data (about 1k + SFT + design prompt), while  R1 Zero  is completed through basic training (Base) plus RL reinforcement training. In S1, they adopted  the budget forcing  method to force the maximum and minimum number of thinking tokens to be set during testing. Specifically: 
  •  Control the thinking limit by adding  "end-of-thinking token separator"  and  "Final Answer" ;
  • Control the thinking limit by disabling the generation of separators and adding  the "wait"  prompt word.
The experimental results show that moderately increasing the number of thinking tokens can indeed improve the performance of the model on the AIME24 benchmark. However, they also found that excessive suppression of the end of thinking will cause the model to fall into an invalid loop. This finding is very intuitive: just like human thinking, simple problems (such as counting the number of letters in a word) do not require excessive thinking , and those that really require extended thinking time are often more complex problems. 
s1 Reference Reading: 16 H100 images trained in 26 minutes, surpassing o1-preview! Fei-Fei Li and others used 1K samples to reveal the secret of scaling during testing

Summary and Outlook

First of all, I would like to thank Unsloth again for the optimization and the efforts of the community partners. This not only makes the training and reasoning of large models more efficient, but also greatly reduces the consumption of video memory, so that even with only one graphics card, R1 Zero can be easily reproduced. This also provides a more economical and simpler reproduction solution, and opens up new possibilities for the application of large models in low-resource environments.