Getting started from scratch: DeepSeek fine-tuning tutorial is here!

In-depth and easy-to-understand, let you easily master DeepSeek fine-tuning skills!
Core content:
1. Comparison of the effect before and after fine-tuning
2. Concept and analogy explanation of large model fine-tuning
3. Real life examples to help you understand the fine-tuning process
1. What is Large Model Fine-tuning?
? Story explanation:
? Life case explanation:
The basic speaker only speaks Mandarin (pre-trained model)
Let it listen to 100 Sichuan dialect sentences (fine-tuning data)
Now I can understand "Bai Long Men Zhen" (dialect comprehension ability↑)
Original camera captures all scenes (universal model)
Load the "Food Filter" parameters (fine-tuned model)
Automatically increase saturation when taking photos of food (professional enhancement)
Enhanced explanation: Lego castle transformed into children's hospital
Step 1: Original Structure - Generic Lego Castle
Step 2: Partial modification - low-cost modification
Step 3: New function - Transform into a children's hospital
2. Hardware configuration currently tried
3. Fine-tuning work
(1) Dataset preparation
(2) Model fine-tuning code (here is pure hand-made without framework) - directly put it up, and there will be detailed explanation later
Libraries that need to be introduced: pip install torch transformers peft datasets matplotlib accelerate safetensors
import torch
import matplotlib.pyplot as plt
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
TrainingArguments,
Trainer,
TrainerCallback
)
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
import os
#Configure the path (modify according to the actual path)
model_path = r"your model path" # model path
data_path = r"your dataset path" # dataset path
output_path = r"Your model save path after fine-tuning" # Model save path after fine-tuning
# Force use of GPU
assert torch.cuda.is_available(), // "Must use GPU for training!"
device = torch.device( "cuda" )
# Custom callback to record Loss
class LossCallback (TrainerCallback) :
def __init__ (self) :
self.losses = []
def on_log (self, args, state, control, logs=None, **kwargs) :
if "loss" in logs:
self.losses.append(logs[ "loss" ])
# Data preprocessing function
def process_data (tokenizer) :
dataset = load_dataset( "json" , data_files=data_path, split= "train[:1500]" )
def format_example (example) :
instruction = f"Diagnose the problem: {example[ 'Question' ]} \nDetailed analysis: {example[ 'Complex_CoT' ]} "
inputs = tokenizer(
f" {instruction} \n### Answer: \n {example[ 'Response' ]} <|endoftext|>" ,
padding = "max_length" ,
truncation = True ,
max_length = 512 ,
return_tensors = "pt"
)
return { "input_ids" : inputs[ "input_ids" ].squeeze( 0 ), "attention_mask" : inputs[ "attention_mask" ].squeeze( 0 )}
return dataset.map(format_example, remove_columns=dataset.column_names)
# LoRA configuration
peft_config = LoraConfig(
r = 16 ,
lora_alpha= 32 ,
target_modules=[ "q_proj" , "v_proj" ],
lora_dropout = 0.05 ,
bias= "none" ,
task_type= "CAUSAL_LM"
)
# Training parameter configuration
training_args = TrainingArguments(
output_dir=output_path,
per_device_train_batch_size= 2 , # Video memory optimization settings
gradient_accumulation_steps= 4 , # Accumulated gradient is equivalent to batch_size=8
num_train_epochs = 3 ,
learning_rate = 3e-4 ,
fp16= True , # Enable mixed precision
logging_steps = 20 ,
save_strategy = "no" ,
report_to = "none" ,
optim = "adamw_torch" ,
no_cuda= False , # Force the use of CUDA
dataloader_pin_memory= False , # Speed up data loading
remove_unused_columns = False # Prevent removal of unused columns
)
def main () :
# Create output directory
os.makedirs(output_path, exist_ok= True )
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token
# Load the model to the GPU
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16,
device_map={ "" : device} # Force the use of a specific GPU
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# Prepare data
dataset = process_data(tokenizer)
# Training callback
loss_callback = LossCallback()
# Data Loader
def data_collator (data) :
batch = {
"input_ids" : torch.stack([torch.tensor(d[ "input_ids" ]) for d in data]).to(device),
"attention_mask" : torch.stack([torch.tensor(d[ "attention_mask" ]) for d in data]).to(device),
"labels" : torch.stack([torch.tensor(d[ "input_ids" ]) for d in data]).to(device) # Use input_ids as labels
}
return batch
# Create Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
data_collator=data_collator,
callbacks=[loss_callback]
)
# Start training
print( "Start training..." )
trainer.train()
# Save the final model
trainer.model.save_pretrained(output_path)
print( f"The model has been saved to: {output_path} " )
# Draw the training set loss curve
plt.figure(figsize=( 10 , 6 ))
plt.plot(loss_callback.losses)
plt.title( "Training Loss Curve" )
plt.xlabel( "Steps" )
plt.ylabel( "Loss" )
plt.savefig(os.path.join(output_path, "loss_curve.png" ))
print( "Loss curve has been saved" )
if __name__ == "__main__" :
main()
(3) Detailed explanation of the code
1. Import necessary libraries and modules
import torchimport matplotlib.pyplot as pltfrom transformers import (# HuggingFace Transformer model toolAutoTokenizer,AutoModelForCausalLM,TrainingArguments,Trainer,TrainerCallback)from peft import LoraConfig, get_peft_model# Parameter efficient fine-tuning libraryfrom datasets import load_dataset# Dataset loading toolimport os# System path operation
About the library:
1. torch (core module of PyTorch library)
Function : Deep learning framework that provides tensor calculation and neural network construction functions.
Functions in the code :
Manage GPU devices ( torch.cuda.is_available() checks GPU availability)
Define tensor operations for model training
Controlling mixed precision training ( torch.float16 )
2. matplotlib.pyplot (Matplotlib drawing library)
Function : Data visualization tool library.
Functions in the code :
Plot the training loss curve ( plt.plot(losses) )
Generate and save the loss change graph of the training process ( loss_curve.png )
3. transformers (HuggingFace Transformers library)
Core components :
AutoTokenizer : Automatically load the tokenizer corresponding to the pre-trained model
Used to convert text into a sequence of token IDs that the model can understand
AutoModelForCausalLM : Automatically load causal language models (such as the GPT family)
Provides a basic large language model structure
TrainingArguments : Defines training hyperparameters
Control batch size, learning rate, logging frequency, etc.
Trainer : Class that encapsulates the training process
Automatically handles training loops, gradient descent, logging, etc.
TrainerCallback : training callback base class
Used to implement custom training monitoring logic (such as loss logging in the example)
4. peft (Parameter-Efficient Fine-Tuning)
Function : A library that implements efficient parameter fine-tuning methods.
Core components :
LoraConfig : Configuration class for LoRA (Low-Rank Adaptation)
Define key parameters such as rank ( r ), target modules ( target_modules )
get_peft_model : Convert the base model to a PEFT model
Only about 0.1% of the original model parameters need to be trained to achieve effective fine-tuning
Functions in the code :
Lightweight fine-tuning of large models such as LLaMA
The memory usage is reduced by about 60-70%, which is suitable for consumer-grade GPUs
5. datasets (HuggingFace Datasets library)
Function : Efficient dataset loading and processing tool.
Core methods :
load_dataset : Load data in multiple formats
Supports JSON/CSV/Parquet and other formats (JSON is used in the example)
map : Data preprocessing pipeline
Apply a custom formatting function ( format_example )
Functions in the code :
Load the medical question answering dataset from a local file
Convert the raw data into the input format required by the model
6. os (operating system interface)
Function : Provide operating system related functions.
Functions in the code :
Create output directory ( os.makedirs )
Handling file path related operations
Ensure the validity of the model save path
2. Configuration path and hardware check
#Configure the path (modify according to the actual path)
model_path = r"your model path" # Pre-trained model storage path
data_path = r"your dataset path" # training data path (JSON format)
output_path = r"Your model path after fine-tuning" # Model save location after fine-tuning
# Force use of GPU (make sure CUDA is available)
assert torch.cuda.is_available(), "Must use GPU for training!"
device = torch.device( "cuda" ) # Specify the use of CUDA device
3.
Custom training callback class
class LossCallback ( TrainerCallback ):
def __init__ ( self ) :
self .losses = [] # List storing loss values
# Triggered when there is log output during training
def on_log ( self , args, state, control, logs=None, **kwargs) :
if "loss" in logs: # filter and record loss value
self .losses.append(logs[ "loss" ])
4. Data preprocessing function
def process_data (tokenizer) :
# Load the dataset from the JSON file (only take the first 1500 records)
dataset = load_dataset( "json" , data_files=data_path, split= "train[:1500]" )
# Single data formatting function
def format_example (example) :
# Splicing instructions and answers (fixed template)
instruction = f"Diagnose the problem: {example[ 'Question' ]} \nDetailed analysis: {example[ 'Complex_CoT' ]} "
inputs = tokenizer(
f" {instruction} \n### Answer: \n {example[ 'Response' ]} <|endoftext|>" , # Add end character
padding = "max_length" , # fill to the maximum length
truncation = True , # truncation if the length is too long
max_length = 512 , # Maximum sequence length
return_tensors = "pt" # Return PyTorch tensors
)
# Return the processed input (remove the batch dimension)
return { "input_ids" : inputs[ "input_ids" ].squeeze( 0 ),
"attention_mask" : inputs[ "attention_mask" ].squeeze( 0 )}
# Apply the formatting function and remove the original column
return dataset.map(format_example, remove_columns=dataset.column_names)
Key Code
1. Combine instructions and answers
Function : Combine the question ( Question ) and the detailed analysis ( Complex_CoT ) into one instruction.
Example :
Input: Question="What should I do if I have a fever?" , Complex_CoT="It may be caused by a cold."
Output: "Diagnostic question: What should I do if I have a fever?\nDetailed analysis: It may be caused by a cold."
Analogy : It’s like writing the problem and analysis on a single piece of paper.
instruction = f"Diagnose the problem: {example['Question']}\nDetailed analysis: {example['Complex_CoT']}"
2. Use a tokenizer to process text
Function : Convert the concatenated text into a format that the model can understand.
Parameter Description :
padding="max_length" : pads the text to a fixed length (512).
truncation=True : If the text exceeds 512 tokens, it will be truncated.
max_length=512 : The maximum length is 512.
return_tensors="pt" : Returns PyTorch tensors.
Example :
Input: "Diagnostic question: What should I do if I have a fever? \nDetailed analysis: It may be caused by a cold. \n### Answer: \nDrink more water and rest. "
Output: input_ids=[101, 234, 345, ..., 102] , attention_mask=[1, 1, 1, ..., 1]
Analogy : It's like translating text into numbers that a machine can understand.
inputs = tokenizer(f"{instruction}\n### Answer:\n{example['Response']}<|endoftext|>",# Add end character padding="max_length",# Fill to maximum length truncation=True, # Overlength truncation max_length=512,# Maximum sequence length return_tensors="pt"# Return PyTorch tensor)
3. Return the processed input
Purpose : Returns the processed input data and removes extra dimensions.
Parameter Description :
input_ids : The token ID sequence corresponding to the text.
attention_mask : marks which positions are valid tokens (1 for valid, 0 for padding).
Analogy : It's like arranging the translated numbers into a table.
return {"input_ids": inputs["input_ids"].squeeze(0), "attention_mask": inputs["attention_mask"].squeeze(0)}
4. Apply formatting function
Effect : Applies a formatting function to the entire dataset and removes the original column.
Parameter Description :
format_example : Formatting function.
remove_columns=dataset.column_names : Remove original columns (such as Question , Complex_CoT , etc.).
Analogy : It's like translating every page of an entire book into a format that a machine can understand.
return dataset.map(format_example, remove_columns=dataset.column_names)
5. LoRA fine-tuning configuration
peft_config = LoraConfig(r=16, # LoRA rank (matrix decomposition dimension) lora_alpha=32, # scaling factor (control adapter influence strength) target_modules=["q_proj", "v_proj"], # attention module to be adapted (query/value projection) lora_dropout=0.05, # Dropout rate to prevent overfitting bias="none", # Do not train bias parameters task_type="CAUSAL_LM" # Task type (Causal Language Model))
1. r=16 : rank of LoRA
Function : Control the dimension of low-rank matrix. The smaller the rank, the fewer parameters and the smaller the amount of calculation.
explain :
The rank ( r ) is the decomposition dimension of the low-rank matrix and determines the size of the low-rank matrix.
For example, r=16 means the dimension of the low-rank matrix is 16.
Influence :
A smaller r will reduce the number of parameters but may reduce the performance of the model.
A larger r will increase the number of parameters but may improve the performance of the model.
simile:
Default value : usually set to 8 or 16. The bigger the better. The selection of LoRA rank needs to balance the adaptability and computational efficiency of the model . A larger rank can provide stronger expressive power , but will increase the amount of computation and memory usage , and may lead to overfitting . For simple tasks , it is usually recommended to use a smaller rank (such as 4 or 8), while for complex tasks , a higher rank (such as 16 or 32) may be required.
2. lora_alpha=32 : scaling factor
Function : Control the influence of the low-rank matrix on the original model.
explain :
lora_alpha is a scaling factor used to adjust the output of the low-rank matrix.
Specifically, the output of the low-rank matrix is multiplied by lora_alpha / r .
Influence :
A larger lora_alpha will make the influence of low-rank matrices stronger.
A smaller lora_alpha will make the influence of low-rank matrices weaker.
simile:
Default : Typically set to 32.
3. target_modules=["q_proj", "v_proj"] : target modules
Function : Specifies the model module into which the low-rank matrix needs to be inserted.
explain :
q_proj and v_proj are the attention mechanism modules in the Transformer model:
q_proj : Query projection matrix.
v_proj : Value projection matrix.
LoRA inserts low-rank matrices in these two modules.
Influence :
Choosing different modules will affect the effect of fine-tuning.
q_proj and v_proj are usually chosen because they have a greater impact on the performance of the model.
4. lora_dropout=0.05 : Dropout rate
Function : Prevent overfitting.
explain :
Dropout is a regularization technique that randomly discards some neurons to prevent the model from over-relying on certain features.
lora_dropout=0.05 means that during the training process, 5% of the low-rank matrix parameters will be randomly discarded.
Influence :
A larger Dropout rate will increase the robustness of the model but may reduce training efficiency.
A smaller dropout rate will reduce the regularization effect but may increase training speed.
5. bias="none" : bias parameter
Function : Controls whether to train the bias parameter. The function of the bias parameter is to provide a baseline offset for the output of the model so that the model can better fit the data.
explain :
bias="none" means no bias parameter training.
Other options include "all" (train all bias parameters) and "lora_only" (train only LoRA-related bias parameters).
Influence :
Not training bias parameters can reduce the number of parameters, but may affect the performance of the model.
6. task_type="CAUSAL_LM" : Task type
Purpose : Specify the task type.
explain :
CAUSAL_LM stands for Causal Language Model, which is a generative task (such as GPT).
Other task types include sequence classification ( SEQ_CLS ), sequence to sequence ( SEQ_2_SEQ ), etc.
Influence :
Different mission types will affect how LoRA is implemented.
Training parameter configuration
training_args = TrainingArguments(output_dir=output_path,# Output directory (model/log)per_device_train_batch_size=2, # Single GPU batch size (memory optimization)gradient_accumulation_steps=4, # Gradient accumulation steps (equivalent to batch_size=8)num_train_epochs=3,# Training roundslearning_rate=3e-4,# Initial learning ratefp16=True, # Enable mixed precision training (save video memory)logging_steps=20,# Log every 20 stepssave_strategy="no",# Do not save intermediate checkpointsreport_to="none",# Disable third-party reports (such as W&B)optim="adamw_torch", # Optimizer typeno_cuda=False, # Force use of CUDAdataloader_pin_memory=False, # Disable page-locked memory (speed up data loading)remove_unused_columns=False# Keep unused columns (to avoid data errors)
1. output_dir=output_path : output directory
Function : Specify the save path of the model and logs during training. The output_path here has been written in the first variable.
explain :
The model checkpoints, log files, etc. generated during the training process will be saved in this directory.
Example :
If output_path = "./output" , all files will be saved in the ./output directory.
2. per_device_train_batch_size=2 : Single GPU batch size
Purpose : Set the training batch size on each GPU.
explain :
Batch size refers to the number of samples fed into the model at a time.
Smaller batch sizes save video memory but may slow down training.
Example :
If 1 GPU is used, 2 data will be input for each training.
3. gradient_accumulation_steps=4 : Gradient accumulation steps
4. num_train_epochs=3 : training rounds
Function : Set the number of rounds for the model to be trained on the entire dataset.
explain :
One epoch means that the model goes through the training dataset completely once.
Here it is set to 3, which means the model will be trained for 3 rounds.
Example :
If the data set has 1000 pieces of data, the model will traverse these 1000 pieces of data 3 times.
5. learning_rate=3e-4 : initial learning rate
6. fp16=True : Mixed Precision Training
Purpose : Enable mixed precision training, save video memory and speed up training.
explain :
Mixed precision training refers to using both 16-bit (half-precision) and 32-bit (single-precision) floating point numbers.
16-bit floating point numbers take up less video memory and are faster to calculate.
Example :
If video memory is insufficient, enabling fp16 can significantly reduce video memory usage.
7. logging_steps=20 : Logging frequency
Function : Set the number of steps to record a log.
explain :
The log includes information such as loss value and learning rate.
Here it is set to 20, which means that a log is recorded every 20 steps.
Example :
If the total number of training steps is 1000, 50 logs will be recorded ( 1000 / 20 = 50 ).
8. save_strategy="no" : Save strategy
Function : Set whether to save intermediate checkpoints.
explain :
"no" means do not save intermediate checkpoints.
Other options include "epoch" (save once per round) and "steps" (save every certain number of steps).
Example :
If set to "epoch" , the model will be saved after each round of training.
9. report_to="none" : Disable third-party reporting
Purpose : Disable third-party log reporting tools (such as Weights & Biases).
explain :
If you do not need to use third-party tools to record logs, you can set it to "none" .
Example :
If set to "wandb" , logs will be synchronized to the Weights & Biases platform.
10. optim="adamw_torch" : optimizer type
Function : Specify the optimizer type.
explain :
adamw_torch is a commonly used optimizer that combines Adam and Weight Decay.
Suitable for most deep learning tasks.
Example :
If training is unstable, you can try other optimizers such as sgd [Stochastic Gradient Descent]. SGD is an algorithm for optimizing model parameters by calculating the gradient of the loss function and updating the parameters to minimize the loss function.
11. no_cuda=False : Force the use of CUDA
Purpose : Force the use of GPU for training.
explain :
no_cuda=False means using GPU.
If set to True , the CPU will be used (not recommended).
Example :
If a GPU is available, the model will automatically be trained using the GPU.
12. dataloader_pin_memory=False : Disable page-locked memory
Function : Set whether to use pinned memory to speed up data loading.
explain :
Page-locked memory can increase data loading speed, but will take up more host memory.
Setting this to False disables page-locked memory.
Example :
If the host has sufficient memory, this can be set to True to speed up training.
13. remove_unused_columns=False : Keep unused columns
Function : Set whether to remove unused columns in the dataset.
explain :
If set to True , columns in the dataset that are not used by the model will be removed.
If set to False , all columns will be retained.
Example :
If your dataset contains some extra information (like IDs), you can keep those columns.
Main function (training process)
def main () :
# Create the output directory if it does not exist
os.makedirs(output_path, exist_ok= True )
# Load Tokenizer and set filler
tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token # Use EOS as padding
# Load pre-trained model (half precision + specified GPU)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16, # half-precision loading (saving video memory)
device_map={ "" : device} # Specify the GPU device to use
)
# Apply LoRA adapter
model = get_peft_model(model, peft_config)
model.print_trainable_parameters() # Print the amount of trainable parameters
# Prepare training dataset
dataset = process_data(tokenizer)
# Initialize loss recording callback
loss_callback = LossCallback()
#Data sorting function (constructing batches)
def data_collator (data) :
batch = {
"input_ids" : torch.stack([torch.tensor(d[ "input_ids" ]) for d in data]).to(device),
"attention_mask" : torch.stack([torch.tensor(d[ "attention_mask" ]) for d in data]).to(device),
"labels" : torch.stack([torch.tensor(d[ "input_ids" ]) for d in data]).to(device) # label = input (causal LM task)
}
return batch
# Initialize Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
data_collator=data_collator, # Custom data collation
callbacks=[loss_callback] # Add callback
)
# Execute training
print( "Start training..." )
trainer.train()
# Save the fine-tuned model
trainer.model.save_pretrained(output_path)
print( f"The model has been saved to: {output_path} " )
# Draw the loss curve
plt.figure(figsize=( 10 , 6 ))
plt.plot(loss_callback.losses)
plt.title( "Training Loss Curve" )
plt.xlabel( "Steps" )
plt.ylabel( "Loss" )
plt.savefig(os.path.join(output_path, "loss_curve.png" )) # Save as PNG
print( "Loss curve has been saved" )
if __name__ == "__main__" :
main()
Key code:
1. Load Tokenizer and set filler
Function : Load the tokenizer of the pre-trained model and set the filler.
explain :
AutoTokenizer.from_pretrained : Automatically load a tokenizer that matches the model.
tokenizer.pad_token = tokenizer.eos_token: Use the terminator (EOS) as padding (Pad Token).
Example :
If the input sequence is not long enough, it will be padded with EOS.
tokenizer = AutoTokenizer.from_pretrained(model_path) tokenizer.pad_token = tokenizer.eos_token # Use EOS as padding
2. Load the pre-trained model
Purpose : Load the pre-trained language model and configure hardware-related settings.
explain :
AutoModelF orCausalLM.from_pretrained: loads a causal language model (such as GPT).
torch_dtype=torch.float16 : Use half-precision (16-bit floating point numbers) to load the model and save video memory.
device_map={"": device} : Load the model onto the specified GPU device.
Example :
If device = "cuda:0" , the model will be loaded onto the first GPU.
model = AutoModelForCausalLM.from_pretrained(model_path,torch_dtype=torch.float16, # Half-precision loading (saving video memory) device_map={"": device}# Specify the GPU device to be used)
3. Data sorting function
Function : Organize multiple data into one batch.
explain :
input_ids : The token ID of the input sequence.
attention_mask : marks the location of valid tokens.
labels : The labels for the causal language model are the same as the input (the model needs to predict the next token).
Example :
If the input is ["Diagnostic problem: What to do if you have a fever?", "Diagnostic problem: What to do if you have a headache?"] , it will be sorted into one batch.
def data_collator(data):batch = {"input_ids": torch.stack([torch.tensor(d["input_ids"]) for d in data]).to(device),"attention_mask": torch.stack([torch.tensor(d["attention_mask"]) for d in data]).to(device),"labels": torch.stack([torch.tensor(d["input_ids"]) for d in data]).to(device)# label=input (causal LM task)}return batch
4. Initialization
Trainer
Purpose : Create a trainer object and manage the training process.
explain :
model : The model to be trained.
args : training parameters (such as batch size, learning rate, etc.).
train_dataset : training dataset.
data_collator : custom data collating function.
callbacks : training callbacks (e.g. loss logging).
trainer = Trainer(model=model,args=training_args,train_dataset=dataset,data_collator=data_collator,# Custom data collation callbacks=[loss_callback] # Add callback)
4. Closing remarks
↓