Recommendation
In-depth analysis of large model fine-tuning technology to help efficient execution of machine learning and deep learning tasks.
Core content:
1. Detailed explanation of Fine-tuning technology and its application advantages on large data sets
2. Parameter-Efficient Fine-Tuning (PEFT) technology and its core methods
3. P-Tuning technology principles and practical application case analysis
Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)
There are many ways to fine-tune large models. Large model fine-tuning is a technology that adjusts the parameters of pre-trained models to adapt to specific tasks. It is widely used in the fields of machine learning and deep learning.1. Fine-Tuning
Fine-tuning means retraining the parameters of the entire network or some layers for the target task while retaining most of the pre-trained weights. This method is usually suitable for situations where the target dataset is large and representative.
2. Parameter-Efficient Fine-Tuning (PEFT)
Parameter-efficient fine-tuning methods aim to reduce the number of updateable parameters, thereby reducing computational costs and improving efficiency. Common techniques include LoRA (Low-Rank Adaptation), Prefix Tuning, and P-Tuning .
The core idea of LoRA is to introduce two small matrices after low-rank decomposition on the original matrix as incremental modules for optimization, which can achieve performance improvement with only a small amount of additional storage space .
Prefix-Tuning is a lightweight alternative for fine-tuning natural language generation tasks, which optimizes a continuous task-specific vector (called prefix) with a small number of parameters while freezing the language model parameters. The prefix-Tuning method is inspired by prompting, which allows subsequence characters to be appended to this prefix and uses prompting as "virtual tokens". We apply prefix-tuning to GPT-2 for table-to-text generation and also to BART for text summarization.
BART (Bidirectional and Auto-Regressive Transformer) is a pre-trained text generation model proposed by Facebook AI (FAIR) in 2019. It combines the bidirectional encoding capability of BERT and the autoregressive decoding capability of GPT. It is suitable for tasks such as text repair, text summarization, question-answering systems, and text generation.
BART uses the Transformer encoder-decoder architecture:Encoder: Similar to BERT, it inputs the complete sequence and performs bidirectional modeling.Decoder: Similar to GPT, it gradually predicts the next token and performs autoregressive text generation.P-Tuning stands for Prompt-Tuning, which is a method for fine-tuning large language models (LLM). Different from traditional full-parameter fine-tuning, P-Tuning only inserts learnable "Prompt Embeddings" (also known as Prompt Tokens/Prefix, etc.) in the model input layer or middle layer, thereby greatly reducing the amount of fine-tuning parameters. Its core ideas can be summarized as follows:
Freeze most or all of the original model parametersIntroducing a small number of trainable parameters (Prompt Embeddings)Only update these trainable parameters through gradient backpropagationDuring the training process, the model will splice these prompt embeddings into the original input or the input of the hidden layer inside the model, so that the pre-trained model can better represent/generate tasks. Because only this part of the prompt embeddings is trained, and the main parameters of the model are not changed, the hardware resources and training data requirements are smaller, and the fine-tuning speed is faster.Tip adjustment example:
Input sequence: [Prompt1][Prompt2] "This movie is inspiring."
Question: Rate the emotional tone of this film.
Answer: The model needs to predict sentiment (e.g. “positive”)
Tip: There is no explicit external prompt. [Prompt1][Prompt2] serves as internal prompts to guide the model. The question here is implicit, that is, to judge the emotional tendency expressed in the text.
3. Knowledge DistillationKnowledge distillation transfers the knowledge of a complex large-scale teacher model to a lighter student model to perform the same function. This process generally involves techniques such as soft label generation and temperature scaling .
4. Prompt Learning
Prompt-based tuning uses natural language prompts to guide the model to better understand the input information and convert it into a form suitable for downstream tasks. This method is particularly suitable for scenarios with few or even zero samples.
5.Adapter Tuning
Similar to LoRA technology, the goal of adapter adjustment is to adapt the model to new tasks while keeping the original parameters of the pre-trained model unchanged. The method of adapter adjustment is to insert small neural network modules, called "adapters", between each layer or selected layers of the model. These adapters are trainable, while the parameters of the original model remain unchanged.1. Knowledge extraction stage : Train the Adapter module to learn specific knowledge of downstream tasks and encapsulate the knowledge in the Adapter module parameters.2. Knowledge combination stage : The pre-trained model parameters and task-specific Adapter parameters are fixed, and new parameters are introduced to learn to combine the knowledge in multiple Adapters to improve the performance of the model in the target task.
Model fine-tuning strategy selection
Fine-tuning is a powerful tool that can adapt large pre-trained models to specific tasks and application scenarios. Proper selection and application of fine-tuning strategies are critical to achieving efficient and effective model performance.1. Fine-tuning and transfer learning: Fine-tuning is actually an instance of transfer learning, where a pre-trained model (usually trained on a large general dataset) is used as a starting point for a specific task. This approach enables efficient learning even for tasks with small datasets.2. Choose a fine-tuning strategy: The choice of fine-tuning method depends on multiple factors, including the complexity of the task, the amount of data available, the computing resources, and the desired performance.For example, for complex tasks that require fine-grained control, P-Tuning v2 or LSTM-based P-Tuning may be more suitable. For situations with limited computing resources, methods such as LoRA or Adapter Tuning can be chosen.3. Fine-tuning and model generalization ability: A key issue to pay attention to when fine-tuning is to maintain the generalization ability of the model. Excessive fine-tuning may cause the model to overfit to specific training data and ignore its generalization ability in practical applications.4. Continuous development and innovation: With the continuous development of deep learning and NLP, new fine-tuning techniques and strategies continue to emerge. This requires practitioners to keep an eye on the latest research and technology trends and flexibly select and adjust strategies according to project needs.