Getting started from scratch: DeepSeek fine-tuning evaluation tutorial is here!

Written by

Clara Bennett

Updated on:July-14th-2025

Preface: Large model evaluation is a systematic project. This article hopes to give you an intuitive feeling of the effect of fine-tuning the large model in a relatively popular way. The relevant ideas are intended to serve as a starting point for discussion. If learners have a deep interest in large model evaluation, they can learn from different angles.

Three days ago, I saw an article published on our Datawhale official account titled "Zero-based entry: DeepSeek fine-tuning tutorial is here!" The response was very good. The content was very down-to-earth and suitable for learners to have a learning experience.

So, I tried to reproduce it based on that article and extended the content a little to help readers more intuitively feel the adjustments made to the model by fine-tuning the large model.

For the convenience of learning and experience, the model selected in this article is the distilled DeepSeek-R1-Distill-Qwen-7B model, and the graphics card selected is RTX4090 24G.

The Deepseek model and data set are from the Medical-o1-reasoning-SFT community.

1. Fine-tune the tutorial to reproduce

import  torch
import  matplotlib.pyplot  as  plt
from  transformers  import  (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    TrainerCallback
)
from  peft  import  LoraConfig, get_peft_model
from  datasets  import  load_dataset
import  os

os.environ[ "CUDA_VISIBLE_DEVICES" ] =  "0"   # Specify the use of GPU 

#Configure the path (modify according to the actual path)
model_path =  "xxxx"   # model path
data_path =  "xxxx"   #dataset path
output_path =  "xxxx"   # Model save path after fine-tuning


# Set device parameters
DEVICE =  "cuda"   # Use CUDA
DEVICE_ID =  "0"   # CUDA device ID, empty if not set
device =  f" {DEVICE} : {DEVICE_ID} "  if  DEVICE_ID  else  DEVICE   # Combine CUDA device information
# Custom callback to record Loss
class  LossCallback (TrainerCallback) :
    def  __init__ (self) :
        self.losses = []

    def  on_log (self, args, state, control, logs=None, **kwargs) :
        if  "loss"  in  logs:
            self.losses.append(logs[ "loss" ])

# Data preprocessing function
def  process_data (tokenizer) :
    dataset = load_dataset( "json" , data_files=data_path, split= "train[:1500]" )

    def  format_example (example) :
        instruction =  f"Diagnose the problem: {example[ 'Question' ]} \nDetailed analysis: {example[ 'Complex_CoT' ]} "
        inputs = tokenizer(
            f" {instruction} \n### Answer: \n {example[ 'Response' ]} <|endoftext|>" ,
            padding = "max_length" ,
            truncation = True ,
            max_length = 512 ,
            return_tensors = "pt"
        )
        return  { "input_ids" : inputs[ "input_ids" ].squeeze( 0 ),  "attention_mask" : inputs[ "attention_mask" ].squeeze( 0 )}

    return  dataset.map(format_example, remove_columns=dataset.column_names)

# LoRA configuration
peft_config = LoraConfig(
    r = 16 ,
    lora_alpha= 32 ,
    target_modules=[ "q_proj" ,  "v_proj" ],
    lora_dropout = 0.05 ,
    bias= "none" ,
    task_type= "CAUSAL_LM"
)

# Training parameter configuration
training_args = TrainingArguments(
    output_dir=output_path,
    per_device_train_batch_size= 2 ,   # Video memory optimization settings
    gradient_accumulation_steps= 4 ,   # Accumulated gradient is equivalent to batch_size=8
    num_train_epochs = 3 ,
    learning_rate = 3e-4 ,
    fp16= True ,   # Enable mixed precision
    logging_steps = 20 ,
    save_strategy = "no" ,
    report_to = "none" ,
    optim = "adamw_torch" ,
    no_cuda= False ,   # Force the use of CUDA
    dataloader_pin_memory= False ,   # Speed up data loading
    remove_unused_columns = False ,   # Prevent deleting unused columns
    device = "cuda:0"  # Specify the GPU device to be used    
)

def  main () :
    # Create output directory
    os.makedirs(output_path, exist_ok= True )

    # Load the tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    tokenizer.pad_token = tokenizer.eos_token

    # Load the model to the GPU
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        torch_dtype=torch.float16,
        device_map=device
    )
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()

    # Prepare data
    dataset = process_data(tokenizer)

    # Training callback
    loss_callback = LossCallback()

    # Data Loader
    def  data_collator (data) :
        batch = {
            "input_ids" : torch.stack([torch.tensor(d[ "input_ids" ])  for  d  in  data]).to(device),
            "attention_mask" : torch.stack([torch.tensor(d[ "attention_mask" ])  for  d  in  data]).to(device),
            "labels" : torch.stack([torch.tensor(d[ "input_ids" ])  for  d  in  data]).to(device)   # Use input_ids as labels
        }
        return  batch

    # Create Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
        data_collator=data_collator,
        callbacks=[loss_callback]
    )

    # Start training
    print( "Start training..." )
    trainer.train()

    # Save the final model
    trainer.model.save_pretrained(output_path)
    print( f"The model has been saved to: {output_path} " )

    # Draw the training set loss curve
    plt.figure(figsize=( 10 ,  6 ))
    plt.plot(loss_callback.losses)
    plt.title( "Training Loss Curve" )
    plt.xlabel( "Steps" )
    plt.ylabel( "Loss" )
    plt.savefig(os.path.join(output_path,  "loss_curve.png" ))
    print( "Loss curve has been saved" )

if  __name__ ==  "__main__" :
    main()

For the explanation of fine-tuning, please refer to the content of the previous public account . Let’s take a look at the LOSS curve.

It can be seen that after simple fine-tuning, the LOSS value of the model is reduced, indicating that the Deepseek model is well fitted to the training set data set.

2. Visually compare model generation

After fine-tuning the model, how is the generated content and how to compare it?

At this time, the first thing that comes to our mind is to directly compare the answers generated by the "fine-tuned model" and the "original model" for the same question.

Therefore, we can unify the prompt words, unify the related questions, and then compare the generated answers.

The specific code is as follows:

import  torch
from  transformers  import  AutoTokenizer, AutoModelForCausalLM
from  peft  import  PeftModel
import  os
import  json
from  bert_score  import  score
from  tqdm  import  tqdm
# Set the visible GPU device (adjust according to the actual GPU situation)
os.environ[ "CUDA_VISIBLE_DEVICES" ] =  "0"   #Specify to use only GPU 

# Path configuration ------------------------------------------------------------------------
base_model_path =  "xxxxx"   # Original pre-trained model path
peft_model_path =  "xxxxx"   # Adapter path saved after LoRA fine-tuning

# Model loading ------------------------------------------------------------------------
# Initialize the tokenizer (use the same tokenizer as used during training)
tokenizer = AutoTokenizer.from_pretrained(base_model_path)

# Load the basic model (half-precision loading saves video memory)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_path,
    torch_dtype=torch.float16,   # Use float16 precision
    device_map = "auto"            # Automatically assign devices (CPU/GPU)
)

# Load the LoRA adapter (load fine-tuning parameters on the base model)
lora_model = PeftModel.from_pretrained(
    base_model, 
    peft_model_path,
    torch_dtype=torch.float16,
    device_map= "auto"
)
# Merge LoRA weights into the base model (increases inference speed, but loses the ability to train again)
lora_model = lora_model.merge_and_unload()  
lora_model.eval()   # Set to evaluation mode

# Generate function ------------------------------------------------------------------------
def  generate_response (model, prompt) :
    """Uniform generating function
    parameter:
        model : the model instance to use
        prompt: input text that meets the format requirements
    return:
        Cleaned answer text
    """
    # Input encoding (keep the same processing as training)
    inputs = tokenizer(
        prompt,
        return_tensors = "pt" ,           # return PyTorch tensors
        max_length = 1024 ,                # Maximum input length (same as during training)
        truncation= True ,               # enable truncation
        padding = "max_length"           # Fill to the maximum length (to ensure batch consistency)
    ).to(model.device)                # Ensure that the input and model are on the same device

    # Text generation (turn off gradient calculation to save memory)
    with  torch.no_grad():
        outputs = model.generate(
            input_ids = inputs.input_ids,
            attention_mask=inputs.attention_mask,
            max_new_tokens= 1024 ,        # The maximum number of tokens to generate content (control the length of the answer)
            temperature= 0.7 ,          # Temperature parameter (0.0-1.0, the larger the value, the stronger the randomness)
            top_p = 0.9 ,                # kernel sampling parameter (retain the top 90% of the tokens with cumulative probability)
            repetition_penalty = 1.1 ,   # repetition penalty coefficient (suppress duplicate content when > 1.0)
            eos_token_id=tokenizer.eos_token_id,   # terminator ID
            pad_token_id=tokenizer.pad_token_id,   # padding ID 
        )
    
    # Decode and clean output
    full_text = tokenizer.decode(outputs[ 0 ], skip_special_tokens= True )   # Skip special tokens
    answer = full_text.split( "### Answer: \n" )[ -1 ].strip()   # Extract the answer part
    return  answer

# Comparative test function --------------------------------------------------------------------
def  compare_models (question) :
    """Model comparison function
    parameter:
        question : a medical question in natural language
    """
    # Build a prompt that conforms to the training format (note that the format is exactly the same as that during training)
    prompt =  f"Diagnostic question: {question} \nDetailed analysis: \n### Answer: \n"
    
    # Dual model generation
    base_answer = generate_response(base_model, prompt)   # Original model
    lora_answer = generate_response(lora_model, prompt)   # Fine-tune the model
    
    # Terminal color printing comparison results
    print( "\n"  +  "=" * 50 )   # separator line
    print( f"Question: {question} " )
    print( "-" * 50 )
    print( f"\033[1;34m[original model]\033[0m\n {base_answer} " )   # Blue shows the original model results
    print( "-" * 50 )
    print( f"\033[1;32m[LoRA model]\033[0m\n {lora_answer} " )   # Green shows the results of fine-tuning the model
    print( "=" * 50  +  "\n" )

# Main program ------------------------------------------------------------------------
if  __name__ ==  "__main__" :
        # Test question set (freely expandable)
    test_questions = [
        "According to the description, a one-year-old child developed multiple small nodules on his scalp in the summer, which did not heal for a long time. Now the sores are as big as plums, ulcerating and oozing pus, and the mouths do not close. There are cavities under the scalp, and the skin of the affected area is thickened. What is the diagnosis of this disease in traditional Chinese medicine?"
    ]
    
    # Traverse the test questions
    for  q  in  test_questions:
        compare_models(q)

Let's take a look at the difference in the model's output for the same problem. In order to highlight the difference between the fine-tuned image and the original model, a piece of data from the training set was selected for testing. Readers can conduct random tests based on their own circumstances.

Let's take a look at what was generated.

Based on the generated content, it seems that the LoRA fine-tuned model is still somewhat different from the original model, but this answer is very abstract if compared. After all, as learners, we may not know much about problems in the medical field. Can we use some more intuitive methods to reflect the difference between the fine-tuned model and the original model?

At this time, we thought about whether we can evaluate by text similarity, and use Bertscore to compare models. What is Bertscore? Let's take a look at the reply given to me by the full version of Deepseek. There is too much output, so I won't paste it all here. The main purpose is to measure the similarity of semantics. Then we seem to be able to use Berscore to compare the answers of the training set and the answers generated by the model to intuitively see the difference between the fine-tuned model and the original model.

To facilitate learning, the BERT model selected in the following code is the most basic BERT-base-chinese model, which can also be downloaded from the Mota community.

It should be noted that considering that some learners may not be able to access the official website of hugging face, the bert-base-chinese model here is loaded using an offline model.

Kindly note that model evaluation is very resource-intensive, so it is recommended that learners only call 10 data sets.

Ok, let's take a look at the code:

import  torch
from  transformers  import  AutoTokenizer, AutoModelForCausalLM
from  peft  import  PeftModel
import  os
import  json
from  bert_score  import  score
from  tqdm  import  tqdm
# Set the visible GPU device (adjust according to the actual GPU situation)
os.environ[ "CUDA_VISIBLE_DEVICES" ] =  "0"   #Specify to use only GPU 

# Path configuration ------------------------------------------------------------------------
base_model_path =  "xxxxxx/DeepSeek-R1-Distill-Qwen-7B"   # Original pre-trained model path
peft_model_path =  "xxxxxx/output"   # Adapter path saved after LoRA fine-tuning

# Model loading ------------------------------------------------------------------------
# Initialize the tokenizer (use the same tokenizer as used during training)
tokenizer = AutoTokenizer.from_pretrained(base_model_path)

# Load the basic model (half-precision loading saves video memory)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_path,
    torch_dtype=torch.float16,   # Use float16 precision
    device_map = "auto"            # Automatically assign devices (CPU/GPU)
)

# Load the LoRA adapter (load fine-tuning parameters on the base model)
lora_model = PeftModel.from_pretrained(
    base_model, 
    peft_model_path,
    torch_dtype=torch.float16,
    device_map= "auto"
)
# Merge LoRA weights into the base model (increases inference speed, but loses the ability to train again)
lora_model = lora_model.merge_and_unload()  
lora_model.eval()   # Set to evaluation mode

# Generate function ------------------------------------------------------------------------
def  generate_response (model, prompt) :
    """Uniform generating function
    parameter:
        model : the model instance to use
        prompt: input text that meets the format requirements
    return:
        Cleaned answer text
    """
    # Input encoding (keep the same processing as training)
    inputs = tokenizer(
        prompt,
        return_tensors = "pt" ,           # return PyTorch tensors
        max_length = 1024 ,                # Maximum input length (same as during training)
        truncation= True ,               # enable truncation
        padding = "max_length"           # Fill to the maximum length (to ensure batch consistency)
    ).to(model.device)                # Ensure that the input and model are on the same device

    # Text generation (turn off gradient calculation to save memory)
    with  torch.no_grad():
        outputs = model.generate(
            input_ids = inputs.input_ids,
            attention_mask=inputs.attention_mask,
            max_new_tokens= 1024 ,        # The maximum number of tokens to generate content (control the length of the answer)
            temperature= 0.7 ,          # Temperature parameter (0.0-1.0, the larger the value, the stronger the randomness)
            top_p = 0.9 ,                # kernel sampling parameter (retain the top 90% of the tokens with cumulative probability)
            repetition_penalty = 1.1 ,   # repetition penalty coefficient (suppress duplicate content when > 1.0)
            eos_token_id=tokenizer.eos_token_id,   # terminator ID
            pad_token_id=tokenizer.pad_token_id,   # padding ID 
        )
    
    # Decode and clean output
    full_text = tokenizer.decode(outputs[ 0 ], skip_special_tokens= True )   # Skip special tokens
    answer = full_text.split( "### Answer: \n" )[ -1 ].strip()   # Extract the answer part
    return  answer

# Comparative test function --------------------------------------------------------------------
def  compare_models (question) :
    """Model comparison function
    parameter:
        question : a medical question in natural language (e.g. "What should I do if my child has a cold?")
    """
    # Build a prompt that conforms to the training format (note that the format is exactly the same as that during training)
    prompt =  f"Diagnostic question: {question} \nDetailed analysis: \n### Answer: \n"
    
    # Dual model generation
    base_answer = generate_response(base_model, prompt)   # Original model
    lora_answer = generate_response(lora_model, prompt)   # Fine-tune the model
    
    # Terminal color printing comparison results
    print( "\n"  +  "=" * 50 )   # separator line
    print( f"Question: {question} " )
    print( "-" * 50 )
    print( f"\033[1;34m[original model]\033[0m\n {base_answer} " )   # Blue shows the original model results
    print( "-" * 50 )
    print( f"\033[1;32m[LoRA model]\033[0m\n {lora_answer} " )   # Green shows the results of fine-tuning the model
    print( "=" * 50  +  "\n" )

# Main program ------------------------------------------------------------------------
if  __name__ ==  "__main__" :
    # Test question set (freely expandable)
    #test_questions = [
    # "According to the description, a one-year-old child developed multiple small nodules on his scalp in the summer, which did not heal for a long time. Now the sores are as big as plums, ulcerating and oozing pus, and the mouths do not close. There are cavities under the scalp, and the skin of the affected area is thickened. What is the diagnosis of this disease in traditional Chinese medicine?"
    # ]
    
    # # Traverse the test questions
    # for q in test_questions:
    # compare_models(q)
    # Load test data
    ####-----------Batch Test---------------#
    with  open( "xxxxxx/data/medical_o1_sft_Chinese.json" )  as  f:
        test_data = json.load(f) 

    # The amount of data is relatively large, so we only select 10 pieces of data for testing
    test_data=test_data[: 10 ]
    # Generate answers in batches
    def  batch_generate (model, questions) :
        answers = []
        for  q  in  tqdm(questions):
            prompt =  f"Diagnostic question: {q} \nDetailed analysis: \n### Answer: \n"
            ans = generate_response(model, prompt)
            answers.append(ans)
        return  answers

    # Generate results
    base_answers = batch_generate(base_model, [d[ "Question" ]  for  d  in  test_data])
    lora_answers = batch_generate(lora_model, [d[ "Question" ]  for  d  in  test_data])
    ref_answers = [d[ "Response" ]  for  d  in  test_data]

    bert_model_path= "xxxxx/model/bert-base-chinese"
    # Calculate BERTScore
    _, _, base_bert = score(base_answers, ref_answers, lang= "zh" ,model_type=bert_model_path,num_layers= 12 ,device= "cuda" )
    _, _, lora_bert = score(lora_answers, ref_answers, lang= "zh" ,model_type=bert_model_path,num_layers= 12 ,device= "cuda" )
    print( f"BERTScore | original model:  {base_bert.mean().item(): .3 f}  | LoRA model:  {lora_bert.mean().item(): .3 f} " )

Let’s look at the results:

It can be seen that by using bertscore to compare the similarity between the reference answers of the dataset and the answers generated by the model, there are still slight differences between the results of LoRA fine-tuning and the original model. As the training rounds of LoRA fine-tuning deepen, and even when we deliberately let the large model produce "overfitting", the difference in the results should be further increased, which can provide learners with a new perspective from a relatively qualitative perspective.

3. Postscript

The evaluation of large models is a relatively complex and systematic task, especially in the financial and medical fields, which involve a high degree of professionalism. In the actual enterprise deployment process, there are more diverse methods to evaluate the quality of model generation.

This article tries to approach the topic from the perspective of beginners, so that learners can understand the difference between the fine-tuned model and the original model more simply and directly.