Meet personalized needs and teach you how to fine-tune the DeepSeek large model step by step

Master DeepSeek large model fine-tuning and unleash the potential of AI.
Core content:
1. Necessity and basic methods of DeepSeek LLM fine-tuning
2. Detailed explanation and code practice of supervised fine-tuning technology (SFT)
3. Key points of loss function, data subset selection and LoRA technology
A complete guide to DeepSeek LLM fine-tuning.
1 Introduction
DeepSeek LLM has powerful performance, but fine-tuning is essential to maximize its performance in specific scenarios. This article explains in detail how to fine-tune it using the Hugging Face dataset and supervised fine-tuning technology (SFT), provides code steps, and discusses key points such as loss functions, data subsets, and low-rank adaptation (LoRA) technology.
For practical operations, you can use the Google Colab platform: colab.research.google.com.
2 Overview of Supervised Fine-tuning (SFT)
Supervised fine-tuning (SFT) is the process of further training a pre-trained model on a labeled dataset so that it can be specialized for a specific task, such as customer support, medical question answering, or e-commerce recommendations.
2.1 Fine-tuning principle
Fine-tune a model trained on labeled data for a specific task, where:
Input (X) : Text data provided to the model. Target (Y) : The expected output given the labeled data (e.g., sentiment label, chatbot response, or summary text). Loss function : measures how well the model’s predictions match the expected output. The most commonly used loss function in text generation is the cross entropy loss.
For example, when fine-tuning on the IMDB sentiment dataset:
Input (X) : Movie reviews like "This movie has great visuals, but a weak plot." Target (Y) : The correct label, such as “positive” or “negative” sentiment.
For text generation tasks, the input can be a question and the target is the correct response generated by the model.
2.2 Cross Entropy Loss: A “Calibrator” for Fine-tuning Language Models
When fine-tuning a language model, the cross-entropy loss is used to measure the difference between the token distribution predicted by the model and the actual target distribution:
The purpose of training is to minimize this loss, making the model predictions closer to reality, thereby generating more accurate text outputs and improving performance in various text tasks.
3 Reasons for choosing data subsets
When fine-tuning a large language model like DeepSeek LLM on resource-constrained hardware, training with a full dataset (e.g., the IMDB dataset containing 25,000 samples) can result in long training times and insufficient GPU memory.
To alleviate these issues, we:
Select data subsets : select 500 samples for training and 100 for evaluation to reduce the amount of data and reduce the burden on hardware. Ensure representativeness : The subset retains diverse features and maintains model performance.
Small datasets can speed up experiments and effectively demonstrate the concept of fine-tuning. However, in a production environment, if you want to make the model perform better, you should use a larger dataset on a more powerful infrastructure.
4 Loading DeepSeek LLM
Before fine-tuning, DeepSeek LLM needs to be loaded and prepared for training.
4.1 Install required libraries
First, install the necessary dependencies:
pip install -U torch transformers datasets accelerate peft bitsandbytes
4.2 Loading the model with 4-bit quantization
We use 4-bit quantization to enable large models to run with limited GPU memory:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
model_name = "deepseek-ai/deepseek-llm-7b-base"
# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit= True ,
bnb_4bit_compute_dtype=torch.float16 # Use float16 to speed up calculations
)
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map= "auto"
)
# Apply LoRA for memory efficient fine-tuning
lora_config = LoraConfig(
r= 8 , # low rank adaptation size
lora_alpha= 32 ,
target_modules=[ "q_proj" , "v_proj" ], # Apply LoRA to the attention layer
lora_dropout = 0.05 ,
bias= "none"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
print( "✅ DeepSeek LLM has loaded LoRA and uses 4-bit precision!" )
5 Training with the Hugging Face dataset
Fine-tuning requires a high-quality dataset. Hugging Face provides access to a variety of datasets:
5.1 Selecting a Dataset
In this example, we fine-tune DeepSeek LLM for sentiment classification using the IMDB dataset:
from datasets import load_dataset
# Load the dataset
dataset = load_dataset( "imdb" )
5.2 Preprocessing the Dataset
Convert the text into tokenized input acceptable to the model:
def tokenize_function (examples) :
inputs = tokenizer(
examples[ "text" ],
truncation = True ,
padding = "max_length" ,
max_length = 512
)
inputs[ "labels" ] = inputs[ "input_ids" ].copy()
return inputs
tokenized_datasets = dataset.map(tokenize_function, batched= True )
# To speed up the experiment, divide the data set into subsets
small_train_dataset = tokenized_datasets[ "train" ].shuffle(seed= 42 ).select(range( 500 ))
small_test_dataset = tokenized_datasets[ "test" ].shuffle(seed= 42 ).select(range( 100 ))
# Print a sample entry after word segmentation
print( "Sample after word segmentation:" )
print(small_train_dataset[ 0 ])
6 LoRA (Low Rank Adaptation): A Magic Tool for Memory Optimization of Large Model Fine-tuning
When fine-tuning large language models, memory utilization is a problem, and LoRA (low-rank adaptation) technology comes to the rescue. LoRA mainly relies on two "unique skills":
Freeze most of the model’s weights so that it remains “on hold” while fine-tuning;
Introduce low-rank trainable matrices in key layers (such as attention layers) to accurately optimize the model.
This can significantly reduce the number of trainable parameters without affecting model performance. With LoRA, it is no problem to fine-tune large models on resource-constrained hardware such as Colab GPU, creating more possibilities for developers.
How LoRA works
1) Decompose the parameter update into a low-rank matrix.
2) Apply updates only to the factorized matrices (such as attention projections).
3) Compared with full fine-tuning, it can reduce memory and computing costs.
7 Code Explanation: Fine-tuning DeepSeek LLM
7.1 Setting training parameters
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir = "./results" ,
evaluation_strategy= "epoch" ,
learning_rate= 3e-4 , # Use a lower learning rate when fine-tuning LoRA
per_device_train_batch_size= 1 , # Reduce batch size to improve memory efficiency
gradient_accumulation_steps= 8 , # simulate larger batch size
num_train_epochs = 0.5 ,
weight_decay= 0.01 ,
save_strategy = "epoch" ,
logging_dir = "./logs" ,
logging_steps = 50 ,
fp16= True # Mixed precision training
)
7.2 Initializing the Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train_dataset,
eval_dataset=small_test_dataset,
)
print( "Trainer initialized!" )
7.3 Start fine-tuning
print( "Start fine-tuning..." )
trainer.train()
7.4 Save the fine-tuned model
trainer.save_model( "./fine_tuned_deepseek" )
tokenizer.save_pretrained( "./fine_tuned_deepseek" )
print( "The fine-tuned model has been saved successfully!" )
8 Advanced Road to Large Model Training and Optimization
Conduct production-level training : Although the early experiments used a subset of data to complete the concept verification and basic fine-tuning, a larger data set must be used to make the model have strong generalization ability and excellent performance in actual production. Taking the intelligent customer service model as an example, small-scale data is difficult to cope with the diverse problems of users, while a large-scale corpus of real interaction records can help the model learn more and meet the needs of high concurrency and multiple scenarios.
Explore advanced LoRA configurations : LoRA has obvious advantages, but it still has great potential. We can study the combination of low-rank matrix dimensions in the future to find the optimal solution for reducing costs and improving efficiency. We can also combine other optimization techniques, such as optimizing the learning rate, to make the model converge faster, laying the foundation for fine-tuning complex tasks and large-scale models.