Woter AI detection.Hurry - ends Jul 21st

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

Meet personalized needs and teach you how to fine-tune the DeepSeek large model step by step

Written by

Audrey Miles

Updated on:July-16th-2025

A complete guide to DeepSeek LLM fine-tuning.

1 Introduction

DeepSeek LLM has powerful performance, but fine-tuning is essential to maximize its performance in specific scenarios. This article explains in detail how to fine-tune it using the Hugging Face dataset and supervised fine-tuning technology (SFT), provides code steps, and discusses key points such as loss functions, data subsets, and low-rank adaptation (LoRA) technology.

For practical operations, you can use the Google Colab platform: colab.research.google.com.

2 Overview of Supervised Fine-tuning (SFT)

Supervised fine-tuning (SFT) is the process of further training a pre-trained model on a labeled dataset so that it can be specialized for a specific task, such as customer support, medical question answering, or e-commerce recommendations.

2.1 Fine-tuning principle

Fine-tune a model trained on labeled data for a specific task, where:

Input (X) : Text data provided to the model.
Target (Y) : The expected output given the labeled data (e.g., sentiment label, chatbot response, or summary text).
Loss function : measures how well the model’s predictions match the expected output. The most commonly used loss function in text generation is the cross entropy loss.

For example, when fine-tuning on the IMDB sentiment dataset:

Input (X) : Movie reviews like "This movie has great visuals, but a weak plot."
Target (Y) : The correct label, such as “positive” or “negative” sentiment.

For text generation tasks, the input can be a question and the target is the correct response generated by the model.

2.2 Cross Entropy Loss: A “Calibrator” for Fine-tuning Language Models

When fine-tuning a language model, the cross-entropy loss is used to measure the difference between the token distribution predicted by the model and the actual target distribution:

The purpose of training is to minimize this loss, making the model predictions closer to reality, thereby generating more accurate text outputs and improving performance in various text tasks.

3 Reasons for choosing data subsets

When fine-tuning a large language model like DeepSeek LLM on resource-constrained hardware, training with a full dataset (e.g., the IMDB dataset containing 25,000 samples) can result in long training times and insufficient GPU memory.

To alleviate these issues, we:

Select data subsets : select 500 samples for training and 100 for evaluation to reduce the amount of data and reduce the burden on hardware.
Ensure representativeness : The subset retains diverse features and maintains model performance.

Small datasets can speed up experiments and effectively demonstrate the concept of fine-tuning. However, in a production environment, if you want to make the model perform better, you should use a larger dataset on a more powerful infrastructure.

4 Loading DeepSeek LLM

Before fine-tuning, DeepSeek LLM needs to be loaded and prepared for training.

4.1 Install required libraries

First, install the necessary dependencies:

pip install -U torch transformers datasets accelerate peft bitsandbytes

4.2 Loading the model with 4-bit quantization

We use 4-bit quantization to enable large models to run with limited GPU memory:

from  transformers  import  AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from  peft  import  LoraConfig, get_peft_model

model_name =  "deepseek-ai/deepseek-llm-7b-base"
# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True ,
    bnb_4bit_compute_dtype=torch.float16   # Use float16 to speed up calculations
)
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    quantization_config=bnb_config, 
    device_map= "auto"
)
# Apply LoRA for memory efficient fine-tuning
lora_config = LoraConfig(
    r= 8 ,   # low rank adaptation size
    lora_alpha= 32 ,
    target_modules=[ "q_proj" ,  "v_proj" ],   # Apply LoRA to the attention layer
    lora_dropout = 0.05 ,
    bias= "none"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
print( "✅ DeepSeek LLM has loaded LoRA and uses 4-bit precision!" )

5 Training with the Hugging Face dataset

Fine-tuning requires a high-quality dataset. Hugging Face provides access to a variety of datasets:

5.1 Selecting a Dataset

In this example, we fine-tune DeepSeek LLM for sentiment classification using the IMDB dataset:

from  datasets  import  load_dataset

# Load the dataset
dataset = load_dataset( "imdb" )

5.2 Preprocessing the Dataset

Convert the text into tokenized input acceptable to the model:

def tokenize_function (examples) : 
    inputs = tokenizer(
        examples[ "text" ], 
        truncation = True , 
        padding = "max_length" , 
        max_length = 512
    )
    inputs[ "labels" ] = inputs[ "input_ids" ].copy()
    return  inputs

tokenized_datasets = dataset.map(tokenize_function, batched= True )
# To speed up the experiment, divide the data set into subsets
small_train_dataset = tokenized_datasets[ "train" ].shuffle(seed= 42 ).select(range( 500 ))
small_test_dataset = tokenized_datasets[ "test" ].shuffle(seed= 42 ).select(range( 100 ))
# Print a sample entry after word segmentation
print( "Sample after word segmentation:" )
print(small_train_dataset[ 0 ])

6 LoRA (Low Rank Adaptation): A Magic Tool for Memory Optimization of Large Model Fine-tuning

When fine-tuning large language models, memory utilization is a problem, and LoRA (low-rank adaptation) technology comes to the rescue. LoRA mainly relies on two "unique skills":

Freeze most of the model’s weights so that it remains “on hold” while fine-tuning;
Introduce low-rank trainable matrices in key layers (such as attention layers) to accurately optimize the model.

This can significantly reduce the number of trainable parameters without affecting model performance. With LoRA, it is no problem to fine-tune large models on resource-constrained hardware such as Colab GPU, creating more possibilities for developers.

How LoRA works

1) Decompose the parameter update into a low-rank matrix.

2) Apply updates only to the factorized matrices (such as attention projections).

3) Compared with full fine-tuning, it can reduce memory and computing costs.

7 Code Explanation: Fine-tuning DeepSeek LLM

7.1 Setting training parameters

from  transformers  import  TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir = "./results" ,
    evaluation_strategy= "epoch" ,
    learning_rate= 3e-4 ,   # Use a lower learning rate when fine-tuning LoRA
    per_device_train_batch_size= 1 ,   # Reduce batch size to improve memory efficiency
    gradient_accumulation_steps= 8 ,   # simulate larger batch size
    num_train_epochs = 0.5 ,
    weight_decay= 0.01 ,
    save_strategy = "epoch" ,
    logging_dir = "./logs" ,
    logging_steps = 50 ,
    fp16= True # Mixed precision training
)

7.2 Initializing the Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_test_dataset,
)
print( "Trainer initialized!" )

7.3 Start fine-tuning

print( "Start fine-tuning..." )
trainer.train()

7.4 Save the fine-tuned model

trainer.save_model( "./fine_tuned_deepseek" )
tokenizer.save_pretrained( "./fine_tuned_deepseek" )
print( "The fine-tuned model has been saved successfully!" )

8 Advanced Road to Large Model Training and Optimization

Conduct production-level training : Although the early experiments used a subset of data to complete the concept verification and basic fine-tuning, a larger data set must be used to make the model have strong generalization ability and excellent performance in actual production. Taking the intelligent customer service model as an example, small-scale data is difficult to cope with the diverse problems of users, while a large-scale corpus of real interaction records can help the model learn more and meet the needs of high concurrency and multiple scenarios.
Explore advanced LoRA configurations : LoRA has obvious advantages, but it still has great potential. We can study the combination of low-rank matrix dimensions in the future to find the optimal solution for reducing costs and improving efficiency. We can also combine other optimization techniques, such as optimizing the learning rate, to make the model converge faster, laying the foundation for fine-tuning complex tasks and large-scale models.