Getting started from scratch: DeepSeek fine-tuning evaluation tutorial is here!

You can master the fine-tuning skills of large models even without any basic knowledge, so that AI models can understand you better!
Core content:
1. Intuitive experience and effect display of large model fine-tuning
2. Select DeepSeek-R1-Distill-Qwen-7B model for fine-tuning
3. Fine-tuning tutorial reproduction and code explanation
Preface: Large model evaluation is a systematic project. This article hopes to give you an intuitive feeling of the effect of fine-tuning the large model in a relatively popular way. The relevant ideas are intended to serve as a starting point for discussion. If learners have a deep interest in large model evaluation, they can learn from different angles.
Three days ago, I saw an article published on our Datawhale official account titled "Zero-based entry: DeepSeek fine-tuning tutorial is here!" The response was very good. The content was very down-to-earth and suitable for learners to have a learning experience.
So, I tried to reproduce it based on that article and extended the content a little to help readers more intuitively feel the adjustments made to the model by fine-tuning the large model.
For the convenience of learning and experience, the model selected in this article is the distilled DeepSeek-R1-Distill-Qwen-7B model, and the graphics card selected is RTX4090 24G.
The Deepseek model and data set are from the Medical-o1-reasoning-SFT community.
1. Fine-tune the tutorial to reproduce
import torch
import matplotlib.pyplot as plt
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
TrainingArguments,
Trainer,
TrainerCallback
)
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
import os
os.environ[ "CUDA_VISIBLE_DEVICES" ] = "0" # Specify the use of GPU
#Configure the path (modify according to the actual path)
model_path = "xxxx" # model path
data_path = "xxxx" #dataset path
output_path = "xxxx" # Model save path after fine-tuning
# Set device parameters
DEVICE = "cuda" # Use CUDA
DEVICE_ID = "0" # CUDA device ID, empty if not set
device = f" {DEVICE} : {DEVICE_ID} " if DEVICE_ID else DEVICE # Combine CUDA device information
# Custom callback to record Loss
class LossCallback (TrainerCallback) :
def __init__ (self) :
self.losses = []
def on_log (self, args, state, control, logs=None, **kwargs) :
if "loss" in logs:
self.losses.append(logs[ "loss" ])
# Data preprocessing function
def process_data (tokenizer) :
dataset = load_dataset( "json" , data_files=data_path, split= "train[:1500]" )
def format_example (example) :
instruction = f"Diagnose the problem: {example[ 'Question' ]} \nDetailed analysis: {example[ 'Complex_CoT' ]} "
inputs = tokenizer(
f" {instruction} \n### Answer: \n {example[ 'Response' ]} <|endoftext|>" ,
padding = "max_length" ,
truncation = True ,
max_length = 512 ,
return_tensors = "pt"
)
return { "input_ids" : inputs[ "input_ids" ].squeeze( 0 ), "attention_mask" : inputs[ "attention_mask" ].squeeze( 0 )}
return dataset.map(format_example, remove_columns=dataset.column_names)
# LoRA configuration
peft_config = LoraConfig(
r = 16 ,
lora_alpha= 32 ,
target_modules=[ "q_proj" , "v_proj" ],
lora_dropout = 0.05 ,
bias= "none" ,
task_type= "CAUSAL_LM"
)
# Training parameter configuration
training_args = TrainingArguments(
output_dir=output_path,
per_device_train_batch_size= 2 , # Video memory optimization settings
gradient_accumulation_steps= 4 , # Accumulated gradient is equivalent to batch_size=8
num_train_epochs = 3 ,
learning_rate = 3e-4 ,
fp16= True , # Enable mixed precision
logging_steps = 20 ,
save_strategy = "no" ,
report_to = "none" ,
optim = "adamw_torch" ,
no_cuda= False , # Force the use of CUDA
dataloader_pin_memory= False , # Speed up data loading
remove_unused_columns = False , # Prevent deleting unused columns
device = "cuda:0" # Specify the GPU device to be used
)
def main () :
# Create output directory
os.makedirs(output_path, exist_ok= True )
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token
# Load the model to the GPU
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16,
device_map=device
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# Prepare data
dataset = process_data(tokenizer)
# Training callback
loss_callback = LossCallback()
# Data Loader
def data_collator (data) :
batch = {
"input_ids" : torch.stack([torch.tensor(d[ "input_ids" ]) for d in data]).to(device),
"attention_mask" : torch.stack([torch.tensor(d[ "attention_mask" ]) for d in data]).to(device),
"labels" : torch.stack([torch.tensor(d[ "input_ids" ]) for d in data]).to(device) # Use input_ids as labels
}
return batch
# Create Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
data_collator=data_collator,
callbacks=[loss_callback]
)
# Start training
print( "Start training..." )
trainer.train()
# Save the final model
trainer.model.save_pretrained(output_path)
print( f"The model has been saved to: {output_path} " )
# Draw the training set loss curve
plt.figure(figsize=( 10 , 6 ))
plt.plot(loss_callback.losses)
plt.title( "Training Loss Curve" )
plt.xlabel( "Steps" )
plt.ylabel( "Loss" )
plt.savefig(os.path.join(output_path, "loss_curve.png" ))
print( "Loss curve has been saved" )
if __name__ == "__main__" :
main()
It can be seen that after simple fine-tuning, the LOSS value of the model is reduced, indicating that the Deepseek model is well fitted to the training set data set.
2. Visually compare model generation
After fine-tuning the model, how is the generated content and how to compare it?
At this time, the first thing that comes to our mind is to directly compare the answers generated by the "fine-tuned model" and the "original model" for the same question.
Therefore, we can unify the prompt words, unify the related questions, and then compare the generated answers.
The specific code is as follows:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import os
import json
from bert_score import score
from tqdm import tqdm
# Set the visible GPU device (adjust according to the actual GPU situation)
os.environ[ "CUDA_VISIBLE_DEVICES" ] = "0" #Specify to use only GPU
# Path configuration ------------------------------------------------------------------------
base_model_path = "xxxxx" # Original pre-trained model path
peft_model_path = "xxxxx" # Adapter path saved after LoRA fine-tuning
# Model loading ------------------------------------------------------------------------
# Initialize the tokenizer (use the same tokenizer as used during training)
tokenizer = AutoTokenizer.from_pretrained(base_model_path)
# Load the basic model (half-precision loading saves video memory)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_path,
torch_dtype=torch.float16, # Use float16 precision
device_map = "auto" # Automatically assign devices (CPU/GPU)
)
# Load the LoRA adapter (load fine-tuning parameters on the base model)
lora_model = PeftModel.from_pretrained(
base_model,
peft_model_path,
torch_dtype=torch.float16,
device_map= "auto"
)
# Merge LoRA weights into the base model (increases inference speed, but loses the ability to train again)
lora_model = lora_model.merge_and_unload()
lora_model.eval() # Set to evaluation mode
# Generate function ------------------------------------------------------------------------
def generate_response (model, prompt) :
"""Uniform generating function
parameter:
model : the model instance to use
prompt: input text that meets the format requirements
return:
Cleaned answer text
"""
# Input encoding (keep the same processing as training)
inputs = tokenizer(
prompt,
return_tensors = "pt" , # return PyTorch tensors
max_length = 1024 , # Maximum input length (same as during training)
truncation= True , # enable truncation
padding = "max_length" # Fill to the maximum length (to ensure batch consistency)
).to(model.device) # Ensure that the input and model are on the same device
# Text generation (turn off gradient calculation to save memory)
with torch.no_grad():
outputs = model.generate(
input_ids = inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens= 1024 , # The maximum number of tokens to generate content (control the length of the answer)
temperature= 0.7 , # Temperature parameter (0.0-1.0, the larger the value, the stronger the randomness)
top_p = 0.9 , # kernel sampling parameter (retain the top 90% of the tokens with cumulative probability)
repetition_penalty = 1.1 , # repetition penalty coefficient (suppress duplicate content when > 1.0)
eos_token_id=tokenizer.eos_token_id, # terminator ID
pad_token_id=tokenizer.pad_token_id, # padding ID
)
# Decode and clean output
full_text = tokenizer.decode(outputs[ 0 ], skip_special_tokens= True ) # Skip special tokens
answer = full_text.split( "### Answer: \n" )[ -1 ].strip() # Extract the answer part
return answer
# Comparative test function --------------------------------------------------------------------
def compare_models (question) :
"""Model comparison function
parameter:
question : a medical question in natural language
"""
# Build a prompt that conforms to the training format (note that the format is exactly the same as that during training)
prompt = f"Diagnostic question: {question} \nDetailed analysis: \n### Answer: \n"
# Dual model generation
base_answer = generate_response(base_model, prompt) # Original model
lora_answer = generate_response(lora_model, prompt) # Fine-tune the model
# Terminal color printing comparison results
print( "\n" + "=" * 50 ) # separator line
print( f"Question: {question} " )
print( "-" * 50 )
print( f"\033[1;34m[original model]\033[0m\n {base_answer} " ) # Blue shows the original model results
print( "-" * 50 )
print( f"\033[1;32m[LoRA model]\033[0m\n {lora_answer} " ) # Green shows the results of fine-tuning the model
print( "=" * 50 + "\n" )
# Main program ------------------------------------------------------------------------
if __name__ == "__main__" :
# Test question set (freely expandable)
test_questions = [
"According to the description, a one-year-old child developed multiple small nodules on his scalp in the summer, which did not heal for a long time. Now the sores are as big as plums, ulcerating and oozing pus, and the mouths do not close. There are cavities under the scalp, and the skin of the affected area is thickened. What is the diagnosis of this disease in traditional Chinese medicine?"
]
# Traverse the test questions
for q in test_questions:
compare_models(q)
Based on the generated content, it seems that the LoRA fine-tuned model is still somewhat different from the original model, but this answer is very abstract if compared. After all, as learners, we may not know much about problems in the medical field. Can we use some more intuitive methods to reflect the difference between the fine-tuned model and the original model?
At this time, we thought about whether we can evaluate by text similarity, and use Bertscore to compare models. What is Bertscore? Let's take a look at the reply given to me by the full version of Deepseek. There is too much output, so I won't paste it all here. The main purpose is to measure the similarity of semantics. Then we seem to be able to use Berscore to compare the answers of the training set and the answers generated by the model to intuitively see the difference between the fine-tuned model and the original model.
To facilitate learning, the BERT model selected in the following code is the most basic BERT-base-chinese model, which can also be downloaded from the Mota community.
It should be noted that considering that some learners may not be able to access the official website of hugging face, the bert-base-chinese model here is loaded using an offline model.
Kindly note that model evaluation is very resource-intensive, so it is recommended that learners only call 10 data sets.
Ok, let's take a look at the code:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import os
import json
from bert_score import score
from tqdm import tqdm
# Set the visible GPU device (adjust according to the actual GPU situation)
os.environ[ "CUDA_VISIBLE_DEVICES" ] = "0" #Specify to use only GPU
# Path configuration ------------------------------------------------------------------------
base_model_path = "xxxxxx/DeepSeek-R1-Distill-Qwen-7B" # Original pre-trained model path
peft_model_path = "xxxxxx/output" # Adapter path saved after LoRA fine-tuning
# Model loading ------------------------------------------------------------------------
# Initialize the tokenizer (use the same tokenizer as used during training)
tokenizer = AutoTokenizer.from_pretrained(base_model_path)
# Load the basic model (half-precision loading saves video memory)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_path,
torch_dtype=torch.float16, # Use float16 precision
device_map = "auto" # Automatically assign devices (CPU/GPU)
)
# Load the LoRA adapter (load fine-tuning parameters on the base model)
lora_model = PeftModel.from_pretrained(
base_model,
peft_model_path,
torch_dtype=torch.float16,
device_map= "auto"
)
# Merge LoRA weights into the base model (increases inference speed, but loses the ability to train again)
lora_model = lora_model.merge_and_unload()
lora_model.eval() # Set to evaluation mode
# Generate function ------------------------------------------------------------------------
def generate_response (model, prompt) :
"""Uniform generating function
parameter:
model : the model instance to use
prompt: input text that meets the format requirements
return:
Cleaned answer text
"""
# Input encoding (keep the same processing as training)
inputs = tokenizer(
prompt,
return_tensors = "pt" , # return PyTorch tensors
max_length = 1024 , # Maximum input length (same as during training)
truncation= True , # enable truncation
padding = "max_length" # Fill to the maximum length (to ensure batch consistency)
).to(model.device) # Ensure that the input and model are on the same device
# Text generation (turn off gradient calculation to save memory)
with torch.no_grad():
outputs = model.generate(
input_ids = inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens= 1024 , # The maximum number of tokens to generate content (control the length of the answer)
temperature= 0.7 , # Temperature parameter (0.0-1.0, the larger the value, the stronger the randomness)
top_p = 0.9 , # kernel sampling parameter (retain the top 90% of the tokens with cumulative probability)
repetition_penalty = 1.1 , # repetition penalty coefficient (suppress duplicate content when > 1.0)
eos_token_id=tokenizer.eos_token_id, # terminator ID
pad_token_id=tokenizer.pad_token_id, # padding ID
)
# Decode and clean output
full_text = tokenizer.decode(outputs[ 0 ], skip_special_tokens= True ) # Skip special tokens
answer = full_text.split( "### Answer: \n" )[ -1 ].strip() # Extract the answer part
return answer
# Comparative test function --------------------------------------------------------------------
def compare_models (question) :
"""Model comparison function
parameter:
question : a medical question in natural language (e.g. "What should I do if my child has a cold?")
"""
# Build a prompt that conforms to the training format (note that the format is exactly the same as that during training)
prompt = f"Diagnostic question: {question} \nDetailed analysis: \n### Answer: \n"
# Dual model generation
base_answer = generate_response(base_model, prompt) # Original model
lora_answer = generate_response(lora_model, prompt) # Fine-tune the model
# Terminal color printing comparison results
print( "\n" + "=" * 50 ) # separator line
print( f"Question: {question} " )
print( "-" * 50 )
print( f"\033[1;34m[original model]\033[0m\n {base_answer} " ) # Blue shows the original model results
print( "-" * 50 )
print( f"\033[1;32m[LoRA model]\033[0m\n {lora_answer} " ) # Green shows the results of fine-tuning the model
print( "=" * 50 + "\n" )
# Main program ------------------------------------------------------------------------
if __name__ == "__main__" :
# Test question set (freely expandable)
#test_questions = [
# "According to the description, a one-year-old child developed multiple small nodules on his scalp in the summer, which did not heal for a long time. Now the sores are as big as plums, ulcerating and oozing pus, and the mouths do not close. There are cavities under the scalp, and the skin of the affected area is thickened. What is the diagnosis of this disease in traditional Chinese medicine?"
# ]
# # Traverse the test questions
# for q in test_questions:
# compare_models(q)
# Load test data
####-----------Batch Test---------------#
with open( "xxxxxx/data/medical_o1_sft_Chinese.json" ) as f:
test_data = json.load(f)
# The amount of data is relatively large, so we only select 10 pieces of data for testing
test_data=test_data[: 10 ]
# Generate answers in batches
def batch_generate (model, questions) :
answers = []
for q in tqdm(questions):
prompt = f"Diagnostic question: {q} \nDetailed analysis: \n### Answer: \n"
ans = generate_response(model, prompt)
answers.append(ans)
return answers
# Generate results
base_answers = batch_generate(base_model, [d[ "Question" ] for d in test_data])
lora_answers = batch_generate(lora_model, [d[ "Question" ] for d in test_data])
ref_answers = [d[ "Response" ] for d in test_data]
bert_model_path= "xxxxx/model/bert-base-chinese"
# Calculate BERTScore
_, _, base_bert = score(base_answers, ref_answers, lang= "zh" ,model_type=bert_model_path,num_layers= 12 ,device= "cuda" )
_, _, lora_bert = score(lora_answers, ref_answers, lang= "zh" ,model_type=bert_model_path,num_layers= 12 ,device= "cuda" )
print( f"BERTScore | original model: {base_bert.mean().item(): .3 f} | LoRA model: {lora_bert.mean().item(): .3 f} " )
Let’s look at the results:
It can be seen that by using bertscore to compare the similarity between the reference answers of the dataset and the answers generated by the model, there are still slight differences between the results of LoRA fine-tuning and the original model. As the training rounds of LoRA fine-tuning deepen, and even when we deliberately let the large model produce "overfitting", the difference in the results should be further increased, which can provide learners with a new perspective from a relatively qualitative perspective.
3. Postscript
The evaluation of large models is a relatively complex and systematic task, especially in the financial and medical fields, which involve a high degree of professionalism. In the actual enterprise deployment process, there are more diverse methods to evaluate the quality of model generation.
This article tries to approach the topic from the perspective of beginners, so that learners can understand the difference between the fine-tuned model and the original model more simply and directly.