Tutorial on reproducing DeepSeek R1 Zero with a single card is here!

DeepSeek R1 Zero single card reproduction technology revealed, the perfect combination of cost-effectiveness and performance!
Core content:
1. Feasibility analysis of single card reproduction of DeepSeek R1 Zero
2. Unsloth+LoRA technology detailed explanation, how to optimize performance and reduce resource consumption
3. Environment construction and reproduction step guide, easy to get started
Why can a single card reproduce?
Reinforcement learning algorithm optimization : It integrates multiple reinforcement learning (RL) algorithms and significantly improves the performance of large models during reasoning and fine-tuning by optimizing the underlying code (such as optimizing the computational graph and reducing redundant operations). The latest quantization technology : greatly reduces video memory consumption, allowing large models that originally required multiple graphics cards to run on a single graphics card. Full LoRA and QLoRA fine-tuning support : Even with limited video memory, R1 Zero can be reproduced with a small amount of resources.
Environment Construction
Installing Unsloth
This article only shows the code parts that are different from the previous one. We also provide the complete training code, which can be obtained at the end of the article.
Note: To be compatible with Unsloth, we need to install a specific version of trl. The specific command is as follows:
# Install unsloth and vllm
pip install unsloth vllm
# Install the specified version of trl (compatible with unsloth)
pip install trl== 0 . 15 . 0
Reference: https://docs.unsloth.ai/get-started/unsloth-notebooks
Configuration file modification
Most of the configurations are consistent with the previous Datawhale-R1.yaml file. To support single-card reproducing R1 Zero, we made the following adjustments:
LoRA parameter settings : Enable LoRA fine-tuning, adjust the LoRA rank ( lora_r ) to 64 (common choices are 8, 16, 32, 64, 128, etc.), and set lora_alpha to 32. Limit answer length : Set max_completion_length to 1024 to control the output length. Optimizer adjustment: The optimizer is set to adamw_8bit to speed up training.
# LoRA parameter adjustment
lora_r: 64# LoRA rank, choose any number greater than 0! 8, 16, 32, 64, 128 are recommended
lora_alpha: 32 # LoRA alpha value
# Training parameters
learning_rate: 1.0e-5 # Learning rate, adjusted to 1e-5
# GRPO algorithm parameters
beta: 0.001 # KL penalty factor
optim: adamw_8bit # Use 8bit optimizer to speed up training
max_prompt_length: 256 # Maximum length of input prompt
max_completion_length: 1024# Output answer length, including reasoning chain
num_generations: 4
use_vllm: true # Enable vLLM to accelerate inference
vllm_gpu_memory_utilization: 0.4# vLLM GPU memory utilization (can be appropriately reduced when memory is tight)
LoRA fine-tuning reference: https://zhuanlan.zhihu.com/p/663557294
Start training
The code to start training is very simple. Since we only need a single card, there is no need to configure the complicated Accelerate library. Just run the following code to run it.
python train_Datawhale-R1_unsloth.py --config Datawhale-R1_unsloth.yaml
Training code optimization interpretation
Based on the Unsloth framework, we simplified and optimized the original code. There are two main ideas:
Patching to speed up training
Before executing the reinforcement learning training code, we added two lines of code to use the PatchFastRL function to "patch" certain RL algorithms (such as GRPO). This operation actually optimizes the computational graph at the bottom layer and reduces redundant calculations, thereby speeding up the training process.
from unsloth import FastLanguageModel, PatchFastRLPatchFastRL("GRPO", FastLanguageModel)# Patch the GRPO algorithm
Improvements to GRPO training functions
Model loading : Load the pre-trained model through the FastLanguageModel.from_pretrained method, enable vLLM fast inference, and support 4-bit loading (or LoRA 16-bit). PEFT fine-tuning : Use the get_peft_model method to apply LoRA fine-tuning to the model, specifying the target module, LoRA parameters, and gradient checkpoints to ensure effective training under limited video memory conditions.
# Define GRPO training function
def grpo_function (
model_args: ModelConfig,
dataset_args: DatasetArguments,
training_args: GRPOConfig,
callbacks: List,
) :
# Record model parameters
logger.info( f"Model parameters {model_args} " )
# Record training/evaluation parameters
logger.info( f"Training/evaluation parameters {training_args} " )
# Load the model and tokenizer from the pre-trained model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_args.model_name_or_path, # model name or path
fast_inference= True , # Enable vLLM fast inference
load_in_4bit= True , # Whether to load the model in 4 bits, False means using LoRA 16 bits
max_lora_rank=model_args.lora_r, # Set the maximum rank of LoRA
max_seq_length=training_args.max_completion_length, # Set the maximum sequence length
gpu_memory_utilization=training_args.vllm_gpu_memory_utilization, # GPU memory utilization, can be reduced if memory is insufficient
attn_implementation=model_args.attn_implementation, # Set the attention implementation method flash attention
)
#PEFT Model
model = FastLanguageModel.get_peft_model(
model,
r = model_args.lora_r,
target_modules = [
"q_proj" , "k_proj" , "v_proj" , "o_proj" , # If OOM is insufficient, QKVO can be removed
"gate_proj" , "up_proj" , "down_proj" ,
],
lora_alpha = model_args.lora_alpha, # Set the alpha value of LoRA
use_gradient_checkpointing = "unsloth" , # Enable unsloth gradient checking
random_state = training_args.seed, # set random seed
)
Reference: https://unsloth.ai/blog/r1-reasoning
Model quantization reference: LLM quantization comprehensive guide (8bits/4bits) https://zhuanlan.zhihu.com/p/671007819
Training results and some thoughts
Is the Aha moment a result of RL training?
Are longer thinking times more effective?
S1 Some conclusions and thoughts of the article
Control the thinking limit by adding "end-of-thinking token separator" and "Final Answer" ; Control the thinking limit by disabling the generation of separators and adding the "wait" prompt word.