Getting started from scratch: DeepSeek fine-tuning tutorial is here!

Written by
Caleb Hayes
Updated on:July-15th-2025
Recommendation

In-depth and easy-to-understand, let you easily master DeepSeek fine-tuning skills!
Core content:
1. Comparison of the effect before and after fine-tuning
2. Concept and analogy explanation of large model fine-tuning
3. Real life examples to help you understand the fine-tuning process

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)


Let’s get straight to the point and show you the effects before and after fine-tuning.
Before fine-tuning:

After fine-tuning:
It can be seen here that the tone of the large model has changed after fine-tuning. According to the author's records, the thinking time of the large model after fine-tuning is shorter.
Next, let’s work together to fine-tune the model and optimize its performance!

1. What is Large Model Fine-tuning?

Fine-tuning is like giving a "nerd" extra lessons, turning him or her from a "generalist" into an "expert" in a certain field .
Here is an example of the medical data that is fine-tuned in this article: Suppose you have a very smart friend who has read books all over the world (equivalent to the pre-training stage of the large model) and can talk to you about various topics such as history, science, and literature. But if you need him to help you read a medical report , although he knows some basic knowledge, he may not be professional enough. At this time, you give him a bunch of medical books and cases and let him learn this knowledge specifically (this is fine-tuning), and he will become more proficient in problems in the medical field.

? Story explanation:

Imagine you have a robot that can draw a kitten? (this is the pre-trained model). Now you want it to learn to draw a kitten with a hat??. You don’t need to teach it to draw from scratch, you just need to show it a lot of pictures of “cats with hats” and then say: “Keep your original drawing ability, but learn to add a hat!” This is fine-tuning!

? Life case explanation:

Case 1: Smart speaker to adjust dialect
  • The basic speaker only speaks Mandarin (pre-trained model)

  • Let it listen to 100 Sichuan dialect sentences (fine-tuning data)

  • Now I can understand "Bai Long Men Zhen" (dialect comprehension ability↑)

Case 2: Camera filter principle
  • Original camera captures all scenes (universal model)

  • Load the "Food Filter" parameters (fine-tuned model)

  • Automatically increase saturation when taking photos of food (professional enhancement)

Enhanced explanation: Lego castle transformed into children's hospital

Step 1: Original Structure - Generic Lego Castle

[Universal Castle] 
▸ Metaphor: It's like the "standard castle building block set" purchased online, which has walls, towers, and spires and can be used as an ordinary house. 
▸ Corresponding technology: Pre-trained models (such as ChatGPT) have learned general language skills, but are not professional enough.

Step 2: Partial modification - low-cost modification

① Remove the spire → Replace it with a dome
[Spiky roof to dome] 
▸Operation : Replace the pointed blocks on the top of the tower with round blocks, which will make it more gentle and cute  .
▸Technical meaning : Fine-tune the top-level parameters of the model (such as modifying the classification head) to make the output style more suitable for children's conversations.
② Install a revolving door [revolving door] 
▸Operation : Insert a rotatable building block module into the door without damaging the original door structure. 
▸Technical meaning : Inserting the adapter module allows the model to add the ability to understand pediatric medical terminology without interfering with the original knowledge.
③ Painting hospital logo
[Hospital logo] 
▸Operation : Stick "cross symbols" and cartoon animal stickers on the outer wall of the castle .
▸ Technical meaning: Feature Shift adjusts the internal representation of the model to make it pay more attention to medical-related vocabulary and childlike expressions.

Step 3: New function - Transform into a children's hospital

[Children's Hospital] 
▸Result : The modified castle can accommodate small patients, with a toy area, a gentle doctor (dome), and special medical equipment (revolving door).
▸ Technical significance: Through lightweight transformation, the general model becomes a "pediatric medical question-and-answer robot" specializing in children's health consultation.

2. Hardware configuration currently tried

Graphics card: NVIDIA GeForce RTX 4060
CPU: Intel Core i7-13700H
Memory: 16 GB (I use 8.8/15.7 GB on my home computer)

3. Fine-tuning work

(1) Dataset preparation

The source of the dataset for this article is medical-o1-reasoning-SFT from the MoDa community.
This article mainly explains that the dataset format is:
In the process of fine-tuning the distillation model of DeepSeek, the introduction of Complex_CoT (complex chain of thought) in the dataset is a key design difference. If only basic question-answer pairs are used for training, the model will find it difficult to fully learn deep reasoning capabilities, resulting in final performance significantly lower than expected. This feature is fundamentally different from the data requirements for fine-tuning conventional large models.

(2) Model fine-tuning code (here is pure hand-made without framework) - directly put it up, and there will be detailed explanation later

Libraries that need to be introduced: pip install torch transformers peft datasets matplotlib accelerate safetensors
import torchimport matplotlib.pyplot as pltfrom transformers import (AutoTokenizer,AutoModelForCausalLM,TrainingArguments,Trainer,TrainerCallback)from peft import LoraConfig, get_peft_modelfrom datasets import load_datasetimport os
#Configure the path (modify according to the actual path)model_path = r"your model path" # model pathdata_path = r"your dataset path" # dataset pathoutput_path = r"Your model save path after fine-tuning" # Model save path after fine-tuning
# Force use of GPUassert torch.cuda.is_available(), // "Must use GPU for training!"device = torch.device( "cuda" )
# Custom callback to record Lossclass LossCallback (TrainerCallback) :def __init__ (self) :self.losses = []
def on_log (self, args, state, control, logs=None, **kwargs) :if "loss" in logs:self.losses.append(logs[ "loss" ])
# Data preprocessing functiondef process_data (tokenizer) :dataset = load_dataset( "json" , data_files=data_path, split= "train[:1500]" )
def format_example (example) :instruction = f"Diagnose the problem: {example[ 'Question' ]} \nDetailed analysis: {example[ 'Complex_CoT' ]} "inputs = tokenizer(f" {instruction} \n### Answer: \n {example[ 'Response' ]} <|endoftext|>" ,padding = "max_length" ,truncation = True ,max_length = 512 ,return_tensors = "pt")return { "input_ids" : inputs[ "input_ids" ].squeeze( 0 ), "attention_mask" : inputs[ "attention_mask" ].squeeze( 0 )}
return dataset.map(format_example, remove_columns=dataset.column_names)
# LoRA configurationpeft_config = LoraConfig(r = 16 ,lora_alpha= 32 ,target_modules=[ "q_proj" , "v_proj" ],lora_dropout = 0.05 ,bias= "none" ,task_type= "CAUSAL_LM")
# Training parameter configurationtraining_args = TrainingArguments(output_dir=output_path,per_device_train_batch_size= 2 , # Video memory optimization settingsgradient_accumulation_steps= 4 , # Accumulated gradient is equivalent to batch_size=8num_train_epochs = 3 ,learning_rate = 3e-4 ,fp16= True , # Enable mixed precisionlogging_steps = 20 ,save_strategy = "no" ,report_to = "none" ,optim = "adamw_torch" ,no_cuda= False , # Force the use of CUDAdataloader_pin_memory= False , # Speed ​​up data loadingremove_unused_columns = False # Prevent removal of unused columns)
def main () :# Create output directoryos.makedirs(output_path, exist_ok= True )
# Load the tokenizertokenizer = AutoTokenizer.from_pretrained(model_path)tokenizer.pad_token = tokenizer.eos_token
# Load the model to the GPUmodel = AutoModelForCausalLM.from_pretrained(model_path,torch_dtype=torch.float16,device_map={ "" : device} # Force the use of a specific GPU)model = get_peft_model(model, peft_config)model.print_trainable_parameters()
# Prepare datadataset = process_data(tokenizer)
# Training callbackloss_callback = LossCallback()
# Data Loaderdef data_collator (data) :batch = {"input_ids" : torch.stack([torch.tensor(d[ "input_ids" ]) for d in data]).to(device),"attention_mask" : torch.stack([torch.tensor(d[ "attention_mask" ]) for d in data]).to(device),"labels" : torch.stack([torch.tensor(d[ "input_ids" ]) for d in data]).to(device) # Use input_ids as labels}return batch
# Create Trainertrainer = Trainer(model=model,args=training_args,train_dataset=dataset,data_collator=data_collator,callbacks=[loss_callback])
# Start trainingprint( "Start training..." )trainer.train()
# Save the final modeltrainer.model.save_pretrained(output_path)print( f"The model has been saved to: {output_path} " )
# Draw the training set loss curveplt.figure(figsize=( 10 , 6 ))plt.plot(loss_callback.losses)plt.title( "Training Loss Curve" )plt.xlabel( "Steps" )plt.ylabel( "Loss" )plt.savefig(os.path.join(output_path, "loss_curve.png" ))print( "Loss curve has been saved" )
if __name__ == "__main__" :main()

(3) Detailed explanation of the code

1. Import necessary libraries and modules

Function summary : Import third-party libraries that the project depends on, including PyTorch basic library, HuggingFace tool library, visualization library, etc.
import torchimport matplotlib.pyplot as pltfrom transformers import (# HuggingFace Transformer model toolAutoTokenizer,AutoModelForCausalLM,TrainingArguments,Trainer,TrainerCallback)from peft import LoraConfig, get_peft_model# Parameter efficient fine-tuning libraryfrom datasets import load_dataset# Dataset loading toolimport os# System path operation

About the library:

1. torch (core module of PyTorch library)

  • Function : Deep learning framework that provides tensor calculation and neural network construction functions.

  • Functions in the code :

    • Manage GPU devices ( torch.cuda.is_available() checks GPU availability)

    • Define tensor operations for model training

    • Controlling mixed precision training ( torch.float16 )


2. matplotlib.pyplot (Matplotlib drawing library)

  • Function : Data visualization tool library.

  • Functions in the code :

    • Plot the training loss curve ( plt.plot(losses) )

    • Generate and save the loss change graph of the training process ( loss_curve.png )


3. transformers (HuggingFace Transformers library)

  • Core components :

    • AutoTokenizer : Automatically load the tokenizer corresponding to the pre-trained model

      • Used to convert text into a sequence of token IDs that the model can understand

    • AutoModelForCausalLM : Automatically load causal language models (such as the GPT family)

      • Provides a basic large language model structure

    • TrainingArguments : Defines training hyperparameters

      • Control batch size, learning rate, logging frequency, etc.

    • Trainer : Class that encapsulates the training process

      • Automatically handles training loops, gradient descent, logging, etc.

    • TrainerCallback : training callback base class

      • Used to implement custom training monitoring logic (such as loss logging in the example)


4. peft (Parameter-Efficient Fine-Tuning)

  • Function : A library that implements efficient parameter fine-tuning methods.

  • Core components :

    • LoraConfig : Configuration class for LoRA (Low-Rank Adaptation)

      • Define key parameters such as rank ( r ), target modules ( target_modules )

    • get_peft_model : Convert the base model to a PEFT model

      • Only about 0.1% of the original model parameters need to be trained to achieve effective fine-tuning

  • Functions in the code :

    • Lightweight fine-tuning of large models such as LLaMA

    • The memory usage is reduced by about 60-70%, which is suitable for consumer-grade GPUs


5. datasets (HuggingFace Datasets library)

  • Function : Efficient dataset loading and processing tool.

  • Core methods :

    • load_dataset : Load data in multiple formats

      • Supports JSON/CSV/Parquet and other formats (JSON is used in the example)

    • map : Data preprocessing pipeline

      • Apply a custom formatting function ( format_example )

  • Functions in the code :

    • Load the medical question answering dataset from a local file

    • Convert the raw data into the input format required by the model


6. os (operating system interface)

  • Function : Provide operating system related functions.

  • Functions in the code :

    • Create output directory ( os.makedirs )

    • Handling file path related operations

    • Ensure the validity of the model save path

2. Configuration path and hardware check

Function summary : configure model/data path, force check GPU availability
#Configure the path (modify according to the actual path)model_path = r"your model path" # Pre-trained model storage pathdata_path = r"your dataset path" # training data path (JSON format)output_path = r"Your model path after fine-tuning" # Model save location after fine-tuning
# Force use of GPU (make sure CUDA is available)assert torch.cuda.is_available(), "Must use GPU for training!"device = torch.device( "cuda" ) # Specify the use of CUDA device

3.Custom training callback class

Function summary : Implement custom callbacks to record the changes in loss values ​​in real time during model training . The loss value is used to measure the gap between the model's predicted results and the actual results. The smaller the loss value, the better the model's performance.
class LossCallback ( TrainerCallback ):def __init__ ( self ) :self .losses = [] # List storing loss values
# Triggered when there is log output during trainingdef on_log ( self , args, state, control, logs=None, **kwargs) :if "loss" in logs: # filter and record loss valueself .losses.append(logs[ "loss" ])

4. Data preprocessing function

Function summary : Load and format training data, converting the original data set into a format that the model can understand .
def process_data (tokenizer) :# Load the dataset from the JSON file (only take the first 1500 records)dataset = load_dataset( "json" , data_files=data_path, split= "train[:1500]" )
# Single data formatting functiondef format_example (example) :# Splicing instructions and answers (fixed template)instruction = f"Diagnose the problem: {example[ 'Question' ]} \nDetailed analysis: {example[ 'Complex_CoT' ]} "inputs = tokenizer(f" {instruction} \n### Answer: \n {example[ 'Response' ]} <|endoftext|>" , # Add end characterpadding = "max_length" , # fill to the maximum lengthtruncation = True , # truncation if the length is too longmax_length = 512 , # Maximum sequence lengthreturn_tensors = "pt" # Return PyTorch tensors)# Return the processed input (remove the batch dimension)return { "input_ids" : inputs[ "input_ids" ].squeeze( 0 ),"attention_mask" : inputs[ "attention_mask" ].squeeze( 0 )}
# Apply the formatting function and remove the original columnreturn dataset.map(format_example, remove_columns=dataset.column_names)

Key Code

1. Combine instructions and answers

  • Function : Combine the question ( Question ) and the detailed analysis ( Complex_CoT ) into one instruction.

  • Example :

    • Input: Question="What should I do if I have a fever?" , Complex_CoT="It may be caused by a cold."

    • Output: "Diagnostic question: What should I do if I have a fever?\nDetailed analysis: It may be caused by a cold."

  • Analogy : It’s like writing the problem and analysis on a single piece of paper.

instruction = f"Diagnose the problem: {example['Question']}\nDetailed analysis: {example['Complex_CoT']}"

2. Use a tokenizer to process text

  • Function : Convert the concatenated text into a format that the model can understand.

  • Parameter Description :

    • padding="max_length" : pads the text to a fixed length (512).

    • truncation=True : If the text exceeds 512 tokens, it will be truncated.

    • max_length=512 : The maximum length is 512.

    • return_tensors="pt" : Returns PyTorch tensors.

  • Example :

    • Input: "Diagnostic question: What should I do if I have a fever? \nDetailed analysis: It may be caused by a cold. \n### Answer: \nDrink more water and rest. "

    • Output: input_ids=[101, 234, 345, ..., 102] , attention_mask=[1, 1, 1, ..., 1]

  • Analogy : It's like translating text into numbers that a machine can understand.

inputs = tokenizer(f"{instruction}\n### Answer:\n{example['Response']}<|endoftext|>",# Add end character padding="max_length",# Fill to maximum length truncation=True, # Overlength truncation max_length=512,# Maximum sequence length return_tensors="pt"# Return PyTorch tensor)

3. Return the processed input

  • Purpose : Returns the processed input data and removes extra dimensions.

  • Parameter Description :

    • input_ids : The token ID sequence corresponding to the text.

    • attention_mask : marks which positions are valid tokens (1 for valid, 0 for padding).

  • Analogy : It's like arranging the translated numbers into a table.

return {"input_ids": inputs["input_ids"].squeeze(0), "attention_mask": inputs["attention_mask"].squeeze(0)}

4. Apply formatting function

  • Effect : Applies a formatting function to the entire dataset and removes the original column.

  • Parameter Description :

    • format_example : Formatting function.

    • remove_columns=dataset.column_names : Remove original columns (such as Question , Complex_CoT , etc.).

  • Analogy : It's like translating every page of an entire book into a format that a machine can understand.

return dataset.map(format_example, remove_columns=dataset.column_names)

5. LoRA fine-tuning configuration

Function summary : Configure LoRA parameters and specify the model module to be adapted.
peft_config = LoraConfig(r=16, # LoRA rank (matrix decomposition dimension) lora_alpha=32, # scaling factor (control adapter influence strength) target_modules=["q_proj", "v_proj"], # attention module to be adapted (query/value projection) lora_dropout=0.05, # Dropout rate to prevent overfitting bias="none", # Do not train bias parameters task_type="CAUSAL_LM" # Task type (Causal Language Model))

1. r=16 : rank of LoRA

  • Function : Control the dimension of low-rank matrix. The smaller the rank, the fewer parameters and the smaller the amount of calculation.

  • explain :

    • The rank ( r ) is the decomposition dimension of the low-rank matrix and determines the size of the low-rank matrix.

    • For example, r=16 means the dimension of the low-rank matrix is ​​16.

  • Influence :

    • A smaller r will reduce the number of parameters but may reduce the performance of the model.

    • A larger r will increase the number of parameters but may improve the performance of the model.

  • simile:

"It is equivalent to setting a 16-page limit on the length of AI's 'study notes'"
→ Few pages (r small): Learn quickly but may miss details
→ Many pages (large r): Learn in detail but slowly
  • Default value : usually set to 8 or 16. The bigger the better. The selection of LoRA rank needs to balance the adaptability and computational efficiency of the model . A larger rank can provide stronger expressive power , but will increase the amount of computation and memory usage , and may lead to overfitting . For simple tasks , it is usually recommended to use a smaller rank (such as 4 or 8), while for complex tasks , a higher rank (such as 16 or 32) may be required.


2. lora_alpha=32 : scaling factor

  • Function : Control the influence of the low-rank matrix on the original model.

  • explain :

    • lora_alpha is a scaling factor used to adjust the output of the low-rank matrix.

    • Specifically, the output of the low-rank matrix is ​​multiplied by lora_alpha / r .

  • Influence :

    • A larger lora_alpha will make the influence of low-rank matrices stronger.

    • A smaller lora_alpha will make the influence of low-rank matrices weaker.

  • simile:

Just like, the size of the volume knob determines the loudness of the sound. If the knob is turned too high, the sound may be deafening or even unbearable; if the knob is turned too low, the sound may be too low to be heard clearly.
Too large lora_alpha may cause the model's training to become unstable, just like too loud a sound may make people uncomfortable. It may cause overfitting because the model is too sensitive to the details of the training data.
A smaller lora_alpha will cause the model to adjust weights more conservatively during training , making the training process more stable , but the speed of adapting to new tasks may be slower.
  • Default : Typically set to 32.


3. target_modules=["q_proj", "v_proj"] : target modules

  • Function : Specifies the model module into which the low-rank matrix needs to be inserted.

  • explain :

    • q_proj  and  v_proj  are the attention mechanism modules in the Transformer model:

      • q_proj : Query projection matrix.

      • v_proj : Value projection matrix.

    • LoRA inserts low-rank matrices in these two modules.

  • Influence :

    • Choosing different modules will affect the effect of fine-tuning.

    • q_proj and v_proj are usually chosen because they have a greater impact on the performance of the model.


4. lora_dropout=0.05 : Dropout rate

  • Function : Prevent overfitting.

  • explain :

    • Dropout is a regularization technique that randomly discards some neurons to prevent the model from over-relying on certain features.

    • lora_dropout=0.05 means that during the training process, 5% of the low-rank matrix parameters will be randomly discarded.

  • Influence :

    • A larger Dropout rate will increase the robustness of the model but may reduce training efficiency.

    • A smaller dropout rate will reduce the regularization effect but may increase training speed.


5. bias="none" : bias parameter

  • Function : Controls whether to train the bias parameter. The function of the bias parameter is to provide a baseline offset for the output of the model so that the model can better fit the data.

  • explain :

    • bias="none" means no bias parameter training.

    • Other options include "all" (train all bias parameters) and "lora_only" (train only LoRA-related bias parameters).

  • Influence :

    • Not training bias parameters can reduce the number of parameters, but may affect the performance of the model.


6. task_type="CAUSAL_LM" : Task type

  • Purpose : Specify the task type.

  • explain :

    • CAUSAL_LM stands for Causal Language Model, which is a generative task (such as GPT).

    • Other task types include sequence classification ( SEQ_CLS ), sequence to sequence ( SEQ_2_SEQ ), etc.

  • Influence :

    • Different mission types will affect how LoRA is implemented.

Training parameter configuration

Function summary : Set training hyperparameters and hardware-related options.
training_args = TrainingArguments(output_dir=output_path,# Output directory (model/log)per_device_train_batch_size=2, # Single GPU batch size (memory optimization)gradient_accumulation_steps=4, # Gradient accumulation steps (equivalent to batch_size=8)num_train_epochs=3,# Training roundslearning_rate=3e-4,# Initial learning ratefp16=True, # Enable mixed precision training (save video memory)logging_steps=20,# Log every 20 stepssave_strategy="no",# Do not save intermediate checkpointsreport_to="none",# Disable third-party reports (such as W&B)optim="adamw_torch", # Optimizer typeno_cuda=False, # Force use of CUDAdataloader_pin_memory=False, # Disable page-locked memory (speed up data loading)remove_unused_columns=False# Keep unused columns (to avoid data errors)

1. output_dir=output_path : output directory

  • Function : Specify the save path of the model and logs during training. The output_path here has been written in the first variable.

  • explain :

    • The model checkpoints, log files, etc. generated during the training process will be saved in this directory.

  • Example :

    • If output_path = "./output" , all files will be saved in the ./output directory.


2. per_device_train_batch_size=2 : Single GPU batch size

  • Purpose : Set the training batch size on each GPU.

  • explain :

    • Batch size refers to the number of samples fed into the model at a time.

    • Smaller batch sizes save video memory but may slow down training.

  • Example :

    • If 1 GPU is used, 2 data will be input for each training.


3. gradient_accumulation_steps=4 : Gradient accumulation steps


4. num_train_epochs=3 : training rounds

  • Function : Set the number of rounds for the model to be trained on the entire dataset.

  • explain :

    • One epoch means that the model goes through the training dataset completely once.

    • Here it is set to 3, which means the model will be trained for 3 rounds.

  • Example :

    • If the data set has 1000 pieces of data, the model will traverse these 1000 pieces of data 3 times.


5. learning_rate=3e-4 : initial learning rate


6. fp16=True : Mixed Precision Training

  • Purpose : Enable mixed precision training, save video memory and speed up training.

  • explain :

    • Mixed precision training refers to using both 16-bit (half-precision) and 32-bit (single-precision) floating point numbers.

    • 16-bit floating point numbers take up less video memory and are faster to calculate.

  • Example :

    • If video memory is insufficient, enabling fp16 can significantly reduce video memory usage.


7. logging_steps=20 : Logging frequency

  • Function : Set the number of steps to record a log.

  • explain :

    • The log includes information such as loss value and learning rate.

    • Here it is set to 20, which means that a log is recorded every 20 steps.

  • Example :

    • If the total number of training steps is 1000, 50 logs will be recorded ( 1000 / 20 = 50 ).


8. save_strategy="no" : Save strategy

  • Function : Set whether to save intermediate checkpoints.

  • explain :

    • "no" means do not save intermediate checkpoints.

    • Other options include "epoch" (save once per round) and "steps" (save every certain number of steps).

  • Example :

    • If set to "epoch" , the model will be saved after each round of training.


9. report_to="none" : Disable third-party reporting

  • Purpose : Disable third-party log reporting tools (such as Weights & Biases).

  • explain :

    • If you do not need to use third-party tools to record logs, you can set it to "none" .

  • Example :

    • If set to "wandb" , logs will be synchronized to the Weights & Biases platform.


10. optim="adamw_torch" : optimizer type

  • Function : Specify the optimizer type.

  • explain :

    • adamw_torch is a commonly used optimizer that combines Adam and Weight Decay.

    • Suitable for most deep learning tasks.

  • Example :

    • If training is unstable, you can try other optimizers such as sgd [Stochastic Gradient Descent]. SGD is an algorithm for optimizing model parameters by calculating the gradient of the loss function and updating the parameters to minimize the loss function.


11. no_cuda=False : Force the use of CUDA

  • Purpose : Force the use of GPU for training.

  • explain :

    • no_cuda=False means using GPU.

    • If set to True , the CPU will be used (not recommended).

  • Example :

    • If a GPU is available, the model will automatically be trained using the GPU.


12. dataloader_pin_memory=False : Disable page-locked memory

  • Function : Set whether to use pinned memory to speed up data loading.

  • explain :

    • Page-locked memory can increase data loading speed, but will take up more host memory.

    • Setting this to False disables page-locked memory.

  • Example :

    • If the host has sufficient memory, this can be set to True to speed up training.


13. remove_unused_columns=False : Keep unused columns

  • Function : Set whether to remove unused columns in the dataset.

  • explain :

    • If set to True , columns in the dataset that are not used by the model will be removed.

    • If set to False , all columns will be retained.

  • Example :

    • If your dataset contains some extra information (like IDs), you can keep those columns.

Main function (training process)

Function summary : Integrate all components and execute the complete training process.
def  main () :# Create the output directory if it does not existos.makedirs(output_path, exist_ok= True )
# Load Tokenizer and set fillertokenizer = AutoTokenizer.from_pretrained(model_path)tokenizer.pad_token = tokenizer.eos_token # Use EOS as padding
# Load pre-trained model (half precision + specified GPU)model = AutoModelForCausalLM.from_pretrained(model_path,torch_dtype=torch.float16, # half-precision loading (saving video memory)device_map={ "" : device} # Specify the GPU device to use)# Apply LoRA adaptermodel = get_peft_model(model, peft_config)model.print_trainable_parameters() # Print the amount of trainable parameters
# Prepare training datasetdataset = process_data(tokenizer)
# Initialize loss recording callbackloss_callback = LossCallback()
#Data sorting function (constructing batches)def data_collator (data) :batch = {"input_ids" : torch.stack([torch.tensor(d[ "input_ids" ]) for d in data]).to(device),"attention_mask" : torch.stack([torch.tensor(d[ "attention_mask" ]) for d in data]).to(device),"labels" : torch.stack([torch.tensor(d[ "input_ids" ]) for d in data]).to(device) # label = input (causal LM task)}return batch
# Initialize Trainertrainer = Trainer(model=model,args=training_args,train_dataset=dataset,data_collator=data_collator, # Custom data collationcallbacks=[loss_callback] # Add callback)
# Execute trainingprint( "Start training..." )trainer.train()
# Save the fine-tuned modeltrainer.model.save_pretrained(output_path)print( f"The model has been saved to: {output_path} " )
# Draw the loss curveplt.figure(figsize=( 10 , 6 ))plt.plot(loss_callback.losses)plt.title( "Training Loss Curve" )plt.xlabel( "Steps" )plt.ylabel( "Loss" )plt.savefig(os.path.join(output_path, "loss_curve.png" )) # Save as PNGprint( "Loss curve has been saved" )
if __name__ == "__main__" :main()

Key code:

1. Load Tokenizer and set filler
  • Function : Load the tokenizer of the pre-trained model and set the filler.

  • explain :

    • AutoTokenizer.from_pretrained : Automatically load a tokenizer that matches the model.

    • tokenizer.pad_token = tokenizer.eos_token: Use the terminator (EOS) as padding (Pad Token).

  • Example :

    • If the input sequence is not long enough, it will be padded with EOS.

tokenizer = AutoTokenizer.from_pretrained(model_path) tokenizer.pad_token = tokenizer.eos_token # Use EOS as padding
2. Load the pre-trained model
  • Purpose : Load the pre-trained language model and configure hardware-related settings.

  • explain :

    • AutoModelF orCausalLM.from_pretrained: loads a causal language model (such as GPT).

    • torch_dtype=torch.float16 : Use half-precision (16-bit floating point numbers) to load the model and save video memory.

    • device_map={"": device} : Load the model onto the specified GPU device.

  • Example :

    • If device = "cuda:0" , the model will be loaded onto the first GPU.

model = AutoModelForCausalLM.from_pretrained(model_path,torch_dtype=torch.float16, # Half-precision loading (saving video memory) device_map={"": device}# Specify the GPU device to be used)
3. Data sorting function
  • Function : Organize multiple data into one batch.

  • explain :

    • input_ids : The token ID of the input sequence.

    • attention_mask : marks the location of valid tokens.

    • labels : The labels for the causal language model are the same as the input (the model needs to predict the next token).

  • Example :

    • If the input is ["Diagnostic problem: What to do if you have a fever?", "Diagnostic problem: What to do if you have a headache?"] , it will be sorted into one batch.

def data_collator(data):batch = {"input_ids": torch.stack([torch.tensor(d["input_ids"]) for d in data]).to(device),"attention_mask": torch.stack([torch.tensor(d["attention_mask"]) for d in data]).to(device),"labels": torch.stack([torch.tensor(d["input_ids"]) for d in data]).to(device)# label=input (causal LM task)}return batch
4. Initialization Trainer
  • Purpose : Create a trainer object and manage the training process.

  • explain :

    • model : The model to be trained.

    • args : training parameters (such as batch size, learning rate, etc.).

    • train_dataset : training dataset.

    • data_collator : custom data collating function.

    • callbacks : training callbacks (e.g. loss logging).

trainer = Trainer(model=model,args=training_args,train_dataset=dataset,data_collator=data_collator,# Custom data collation callbacks=[loss_callback] # Add callback)

4. Closing remarks

Thank you very much for the valuable help provided by Deepseek official website in code modification, data collection and article polishing of this chapter!
The fine-tuning part of this chapter is still relatively basic, resulting in the convergence effect of the loss function not being ideal, and there is still a lot of room for optimization. For example, the data set construction can be more refined, and the code structure needs to be further optimized and adjusted. We are very much looking forward to your valuable suggestions and corrections, so that we can make progress together and explore more fun on the road of AI learning!