Ten Methods for Fine-tuning Large Models

Written by
Jasper Cole
Updated on:July-12th-2025
Recommendation

Explore diversified strategies for fine-tuning large AI models and provide customized solutions for different needs.

Core content:
1. Practical methods and application scenarios of full parameter fine-tuning
2. Comparison of the advantages of partial parameter fine-tuning and adapter fine-tuning
3. Innovative ideas for fine-tuning and their operational details

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
In the field of artificial intelligence, large models have become the core tool for solving complex tasks. However, directly using pre-trained models often cannot meet the needs of specific tasks, so it is necessary to fine-tune the model. The essence of fine-tuning is to allow the model to further learn the data of specific tasks based on general knowledge, thereby improving performance.
However, fine-tuning is not a "one-size-fits-all" operation. There are multiple fine-tuning methods to choose from for different task requirements, data volumes, and resource constraints. This article will introduce 10 large model fine-tuning methods in detail, from full parameter fine-tuning to adapter fine-tuning, from prompt fine-tuning to knowledge distillation. Each method has its unique advantages and applicable scenarios. Whether you are a beginner or an experienced developer, you can find a fine-tuning strategy that suits you here.
1. Full Fine-Tuning 
  • In layman's terms: you have a "brain" (pre-trained model) that has learned a lot of general knowledge, and now you want it to learn a new task (such as sentiment analysis or machine translation). You need to re-adjust all the knowledge from beginning to end.
  • Specific methods:
    Load a pre-trained model (such as BERT, GPT). Prepare data for the new task (such as a labeled sentiment analysis dataset). Retrain all parameters of the model with the new data, usually using a smaller learning rate (such as 1e-5 to 1e-4) to avoid destroying the pre-trained knowledge. Evaluate the model performance on the validation set and adjust hyperparameters (such as learning rate, batch size).
    • Advantages: The effect is usually very good and suitable for scenes with large differences in tasks.
    • Disadvantages: High computational cost and requires a lot of data and computing resources.
    • Applicable scenarios: The task is complex and very different from the pre-training task (for example, migrating from general text understanding to medical text analysis).
    2. Partial Fine-Tuning
    • In layman's terms: you don't want to move the entire "brain", just adjust the last few layers (such as the classification layer), and keep the previous knowledge unchanged.
    • Specific methods:
      Load the pre-trained model. Freeze the first few layers of the model (e.g. the first 10 layers of BERT) and unfreeze only the last few layers (e.g. the classification layers). Train the unfrozen parts with data from the new task. Evaluate the performance on the validation set and adjust the number of unfrozen layers if necessary.
      • Advantages: save computing resources and fast training speed.
      • Disadvantages: If the new task is too different from the pre-training task, the effect may not be good.
      • Applicable scenarios: The new task is similar to the pre-training task (such as text classification).
      3. Adapter Fine-Tuning
      • In layman's terms: you don't want to change the original "brain", so you add some small plug-ins (adapters) to it. These plug-ins are specifically designed to handle new tasks, and the original knowledge is completely unchanged.
      • Specific methods:
        Insert a small neural network module (adapter), usually a two-layer feed-forward network, into each layer of the model. Freeze the parameters of the pre-trained model and only train the adapter module. During training, the adapter module learns how to adapt the intermediate representation of the model to the new task.
        • Advantages: Very resource-saving and suitable for multi-task learning.
        • Disadvantages: The plugin may not be flexible enough and the effect is not as good as full parameter fine-tuning.
        • Applicable scenarios: Resources are limited and multiple tasks need to be processed simultaneously (such as multi-language translation).
        4. Prompt Tuning
        • Popular understanding: You don't want to use your "brain", but you can guide it to give the answer you want by changing the way you ask questions (prompt words). For example, asking "Is this movie good?" and "How good is this movie?" may get different answers.
        • Specific methods:
          Design a prompt template, such as "The emotion of this movie is: [MASK]". Combine the prompt with the input data and input it into the pre-trained model. Let the model fill in the blank part (such as [MASK]) and judge the emotion based on the filling result. You can optimize the model performance by adjusting the design of the prompt.
          • Advantages: No need to train the model, simple and direct.
          • Disadvantages: Designing prompt words requires skills and the effect may be unstable.
          • Applicable scenarios: few-shot or zero-shot learning (very little data).
          5. Prefix Tuning
          • In simple terms: you add a "guide word" (prefix) before entering the question. This guide word is specially trained to tell the model how to answer.
          • Specific methods:
            Add a trainable prefix vector before the input. Freeze the parameters of the pre-trained model and only train the prefix vector. The prefix vector will guide the model to generate output that meets the task requirements.
            • Advantages: More flexible and more effective than prompt fine-tuning.
            • Disadvantages: Prefixes need to be trained, which is a bit more complicated.
            • Applicable scenarios: generation tasks (such as text generation).
            6. Low-Rank Adaptation (LoRA)
            • In layman's terms: you don't want to move the entire "brain", but adjust its behavior in a more efficient way (low-rank matrix). Just like using a small tool to fine-tune a machine instead of disassembling the entire machine.
            • Specific methods:
              Add a low-rank matrix to the model's weight matrix. Freeze the original weights and only train the low-rank matrix. The low-rank matrix can be decomposed (such as SVD) to reduce the number of parameters.
              • Advantages: Save computing resources and suitable for large-scale models.
              • Disadvantages: Requires some mathematical knowledge to understand low-rank matrices.
              • Applicable scenarios: large-scale model fine-tuning (such as GPT-3).
              7. Knowledge Distillation
              • In layman's terms: You have a powerful "teacher model" that teaches a "student model" how to accomplish a task. The student model is smaller and faster, but performs nearly as well as the teacher.
              • Specific methods:
                Train a large "teacher model". Use the teacher model's output (soft labels) to train a smaller "student model". The student model imitates the teacher model's behavior by learning its output distribution.
                • Advantages: Suitable for situations with limited resources, the student model is more portable.
                • Disadvantages: The student model may not be as effective as the teacher.
                • Applicable scenarios: Model compression or deployment to resource-constrained devices (such as mobile phones).
                8. Continual Learning
                • In layman's terms: You keep learning new tasks, but you don't want to forget what you learned before. It's like learning to cook while not forgetting how to drive.
                • Specific methods:
                  Use regularization techniques (such as EWC, Elastic Weight Consolidation) to protect important parameters. Use memory replay techniques to periodically review data from old tasks. Use model expansion techniques to assign independent parameters to each task.
                  • Advantages: Suitable for scenarios that require processing multiple tasks.
                  • Disadvantages: Easy to forget old knowledge, requires special skills to avoid.
                  • Applicable scenarios: multi-task learning or dynamic task environment.
                  9. Multi-Task Learning
                  • In layman's terms, you learn multiple related tasks at the same time, such as learning how to cook while also learning how to shop for groceries. This way, you can promote each other and learn faster.
                  • Specific methods:
                    Design a shared model architecture where multiple tasks share some parameters (such as the bottom layer of BERT). Design an independent output layer for each task (such as a classification layer). Train all tasks simultaneously and balance the importance of tasks through weighted loss functions.
                    • Advantages: It is suitable for situations where there are common points between tasks and the effect is better.
                    • Disadvantage: If the tasks are too different, they may interfere with each other.
                    • Applicable scenarios: There is a strong correlation between tasks (such as text classification and sentiment analysis).
                    10. Domain Adaptation
                    • Popular understanding: You learned to cook Chinese food before, and now you have to learn to cook Western food. Although they are both cooking, the seasonings and methods are different, so you need to make some adjustments.
                    • Specific methods:
                      Fine-tune the model using data from the target domain. Use domain adversarial training to reduce domain differences through adversarial networks. Use domain-specific adapters to adapt the model.
                      • Advantages: Suitable for tasks with large domain differences.
                      • Disadvantages: Requires certain target domain data.
                      • Applicable scenarios: cross-domain tasks (such as migrating from news text to medical text).