4w Stars! A low-cost open source solution for fine-tuning DeepSeek has quietly become popular

Written by

Audrey Miles

Updated on:July-14th-2025

DeepSeek V3/R1 is popular all over the Internet. Solutions and API services based on the original model can be found everywhere, and are trapped in the cycle of low prices and free services.

How can we stand on the shoulders of giants and build high-quality private models at low cost through post-training combined with professional field data to improve business competitiveness and value? Colossal-AI, which has received nearly 40,000 GitHub Stars , has released an open source large model post-training toolbox, including

DeepSeek V3/R1 Full Blood 671B LoRA Low Cost SFT Fine Tuning
Complete reinforcement learning tool chain PPO, GRPO, DPO, SimPO, etc.
Seamlessly adapt to HuggingFace open source models including DeepSeek series distillation models
Compatible with NVIDIA GPU, Huawei Ascend NPU and other hardware
Support mixed precision training, gradient checkpoint and other training acceleration to reduce costs
Flexible training configuration interface, supporting custom reward functions, loss functions, etc.
Provides a flexible parallel strategy configuration interface, including data parallelism, model parallelism, expert parallelism, ZeRO, and Offload, to adapt to different hardware scales

Open source address:
https://github.com/hpcaitech/ColossalAI

Low-cost supervised fine-tuning full-blooded version DeepSeek V3/R1 671B

The parameters of the full version of DeepSeek V3/R1 are as high as 671 billion. How can we perform low-cost fine-tuning? It can be completed quickly with just the following steps.

Dataset preparation

The script receives a JSONL formatted file as input dataset, for example https://github.com/hpcaitech/ColossalAI/blob/main/applications/ColossalChat/examples/training_scripts/lora_sft_data.jsonl . Each line of the dataset should be a list of chat conversations. For example:

[{"role": "user", "content": "Hi, how are you doing?"}, {"role": "assistant", "content": "I'm fine. How can I help you today?"}]

[{"role": "user", "content": "Why didn't Cao Cao call 119 for help during the Battle of Red Cliff?"},{"role": "assistant", "content": "Because there were no telephones or modern fire fighting systems during the Three Kingdoms period, Cao Cao could not call 119 for help."}]

This data format is compatible with Huggingface chat template and supports custom system prompts, so it can be flexibly configured as needed.

Model weight preparation

To ensure better fine-tuning effect, BF16 weights are used for fine-tuning.

If you have downloaded DeepSeek V3/R1 weights for FP8, you can use the DeepSeek official script https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py to convert the weights to BF16 via GPU.

For those using domestic Huawei Ascend computing power, you can download the https://gitee.com/ascend/ModelZoo-PyTorch/blob/master/MindIE/LLM/DeepSeek/DeepSeek-V2/NPU_inference/fp8_cast_bf16.py script to convert the weights.

How to use

After preparing the dataset and model weights, you can use the one-click startup script provided by Colossal-AI https://github.com/hpcaitech/ColossalAI/blob/main/applications/ColossalChat/examples/training_scripts/lora_finetune.py

This script is similar to common SFT scripts and is fully compatible with HuggingFace PEFT. Start the command:

colossalai run --hostfile path-to-host-file --nproc_per_node 8 lora_finetune.py --pretrained path-to-DeepSeek-R1-bf16 --dataset path-to-dataset.jsonl --plugin moe --lr 2e-5 --max_length 256 -g --ep 8 --pp 3 --batch_size 24 --lora_rank 8 --lora_alpha 16 --num_epochs 2 --warmup_steps 8 --tensorboard_dir logs --save_dir DeepSeek-R1-bf16-lora

For more detailed information about each parameter, you can run python lora_finetune.py --help. This script can record learning rate, loss, and grad norm information through tensorboard to facilitate monitoring of training.

Optimizing hardware resource consumption using LoRA

By using optimizations such as LoRA, the sample command has reduced the minimum hardware requirements of SFT DeepSeek V3/R1 671B by nearly 10 times , using 32 Ascend 910B NPU 64GB (using ep=8,pp=4) or 24 H100/H800 GPUs (using ep=8,pp=3). If you enable CPU offload via --zero_cpu_offload, the hardware requirements can be further reduced, but at the cost of some training speed.

As shown in the figure below, the loss can be smoothly reduced in SFT DeepSeek V3/R1 671B.

For development teams with sufficient funds, they can also use the above scripts to efficiently expand the degree of parallelism to hundreds or thousands of cards, and quickly complete DeepSeek V3/R1 671B full parameter fine-tuning or parallel acceleration.

For those with limited budgets who want to build their own DeepSeek R1-like models with reinforcement learning, Colossal-AI also provides a solution and verifies the algorithm using a small model.

Fine-tuning the distilled version of DeepSeek via reinforcement learning

The Colossal-AI team verified and implemented the GRPO algorithm and verifiable rewards in the DeepSeek paper, and conducted experiments using the Qwen2.5-3B-Base model. The reward design is as follows:

reward = 0 if format is wrong;
reward = 1, if the format is correct but the result is wrong;
Reward = 10 if both format and result are correct.

The Colossal-AI team uses the Qwen2.5-3B-Base model as an example and provides a conversation template and settings for verifying GRPO ( https://github.com/hpcaitech/ColossalAI/blob/main/applications/ColossalChat/conversation_template/Qwen_Qwen2.5-3B.json ). You can start it with one click by configuring the following bash file:

https://github.com/hpcaitech/ColossalAI/blob/main/applications/ColossalChat/examples/training_scripts/train_grpo.sh

At the same time, in the GRPO chapter, the Colossal-AI team also provides some findings during the verification process and a detailed description of various parameters for reference.

A template for flexibly configuring reward functions is designed in the code, so users can design their own reward function system according to their specific circumstances.

As can be seen from the figure below, even for the 3B model, the average reward and model response length gradually increase over time.

As the training progresses, we can see some interesting examples. For example, as the training iterates, the model starts to correct itself :

Colossal-AI: The best post-training toolbox

Based on the cost reduction and efficiency improvement of large model pre-training, Colossal-AI is committed to further becoming the best post-training tool for developers to use out of the box, helping users to build private models quickly and at low cost based on open source models.