Re-parameterization and fine-tuning: Revealing how the LoRA family can reduce the cost of large model training

Is the cost of fine-tuning a large model too high? LoRA family technologies can help solve this problem and make model training more efficient.
Core content:
1. The cost of training a large model is high, how can LoRA technology reduce the cost
? 2. The principles and advantages of LoRA, and how to achieve plug-and-play and flexible task switching of models
? 3. The upgrade points of AdaLoRA, how to make the model allocate learning resources more intelligently
1. Ordinary people simply can’t afford large models?
Today's AI models are becoming more and more like "study masters" - for example, GPT-4 can write articles, do math problems, and even understand pictures, but the cost of training such "study masters" is frighteningly high. Just training a large model with 65 billion parameters requires hundreds of top-level graphics cards to work simultaneously for several days. Ordinary developers don't even have enough graphics card memory (abbreviated as "video memory", which can be understood as the "temporary workspace" of the computer), not to mention the real money electricity bill.
Re-parameterized fine-tuning technology is like tearing down the entire building and rebuilding it. Now you only need to "decorate" the key rooms, spending little money to do a big thing. Today, let's talk about this technology family that makes large model fine-tuning from "exclusive to the rich" to "a game for the common people" - LoRA and its upgraded versions.
2. LoRA: Use the "building blocks" thinking to "decorate" the model
1. What is LoRA?
The full name of LoRA (Low-Rank Adaptation of Large Language Models) is "Low-Rank Adaptation Fine-tuning". It sounds complicated, but the principle is like building blocks. Suppose there is a super large "data processing table" (matrix) in the large model. Directly modifying this table requires changing trillions of grids, which is too costly. The approach of LoRA is to split this large table into two small tables and multiply them. Only the grids in these two small tables need to be modified (low-rank decomposition).
For example, if you originally wanted to modify a large table of 1000×1000, you can now split it into two small tables of 1000×10 and 10×1000. The parameters that need to be trained are reduced by 99%! Just like if you want to modify a car, you don’t need to replace the entire engine, just replacing a few core parts can make a big difference in the performance of the car.
2. What problem does it solve?
Save money and time : The Alpaca team used LoRA to fine-tune the 7 billion parameter model, which costs only 1/1000 of the traditional method and can be run on ordinary computers.
Plug and play : After training, the two small tables can be directly merged with the original model, just like putting building blocks together into a complete shape. When using it, you can't feel the splitting process in the middle at all, and the speed is not affected at all.
Flexible task switching : For example, if the model learns to write love letters today and learns to write code tomorrow, you only need to load different "small tables" without retraining the entire model. It is as convenient as changing the theme of a mobile phone.
LoRA is like doing "partial fine decoration" for the "luxury building" of the big model. It does not demolish and rebuild, but only transforms the "small parts" of the core functional areas, and achieves personalized upgrades at a low cost!
3. AdaLoRA: Let the model decide what to learn
1. Where has AdaLoRA been upgraded?
Although LoRA is good, it has a small problem: the size (rank, r) of the "small table" after splitting needs to be set manually. It's like a teacher assigning the same homework to all students in the class, without considering that some students are good at math and some are good at Chinese. AdaLoRA is like an intelligent teacher, which can dynamically adjust the "amount of homework" according to the characteristics of each student (each data feature).
Its principle is to decompose the large table into "small parts" of different importance, allocate more learning resources to the "parts" that contribute more to the model (corresponding to large singular values), and allocate less to the unimportant ones. For example, when learning English, focus on memorizing high-frequency words and spend less time on rare words, so that resources can be used more efficiently.
2. What benefits have been brought?
Learn smarter : In the GLUE benchmark test (similar to the model "full-subject exam"), with the same parameters, AdaLoRA performed 30% better than LoRA, just like students learned to grasp the key points and review more efficiently.
Adapt to multiple tasks : When processing multimodal tasks of images + texts, the "learning focus" can be adjusted 10 times per second, which is much more flexible than manual settings.
Reduce over-learning : Automatically cut off useless "parts", just like organizing your school bag, taking only useful textbooks to reduce the burden.
AdaLoRA is like equipping the model with an "intelligent learning manager" that automatically determines where to spend more energy and avoids wasted effort.
4. QLoRA: "Compress and package" model data so that it can run on a small computer
1. The core secret of QLoRA: compressed video memory
Although the previous LoRA saves parameters, it still occupies a lot of video memory. A model with 65 billion parameters requires 780GB of video memory, which is equivalent to the total memory of 100 ordinary computers. QLoRA solves this pain point by "compressing and packaging" the data - the data originally represented by 16 bits is compressed into 4 bits (4-bit quantization), just like compressing ultra-high-definition movies into high-definition movies, which reduces the size but does not cause much loss in image quality.
2. How did you do it?
Customized compression solution : The NF4 format was invented specifically for data in the model that conforms to the normal distribution (most data conforms to this rule). After compression, the performance is retained at 99.3%, and the difference is almost invisible.
Double compression : For model parameters of numerical type, not only the data type of the parameter is compressed, but also the "factor" generated during compression is compressed again to further save space.
Memory Management Master : Through the "Paging Optimizer", data can be retrieved when needed, just like borrowing books from a library. A single graphics card can fine-tune a model with 65 billion parameters. Ordinary developers can finally reach the threshold of the "Academic Master Model".
3. How effective is it in actual combat?
Video memory has been greatly reduced : from 780GB to 48GB, which is equivalent to compressing the contents of 100 books into 1 book.
No compromise on speed : A large model can be trained in 24 hours, which was unimaginable before.
One-click operation : Platforms such as Hugging Face have integrated QLoRA, and you can start with just a few clicks of a button, just as easy as editing a photo with a mobile phone app.
QLoRA is like "letting go" of model data, throwing away redundant information, so that a small computer can run a large model smoothly.
5. DyLoRA: Transformers for large model training
1. Dynamic adjustment: keep it simple when it is necessary and make it complex when it is necessary
Although the previous techniques are good, the size (rank, r) of the "small table" after splitting is fixed during training, just like students practicing with test papers of fixed difficulty. DyLoRA makes this process dynamic - use small tables for simple questions (simple data) and automatically switch to large tables for difficult questions (complex data), just like Transformers switch forms according to the strength of the enemy.
2. How to achieve "dynamic transformation"?
Practice with multiple sets of equipment at the same time : Prepare multiple sets of "small tables" from r=1 to r=64 at the same time during training, just like an athlete practicing short-distance running and long-distance running at the same time and switching between states at any time.
Automatically select the best equipment : During inference, the most appropriate "small table" is automatically selected based on the complexity of the input data. For example, a simple table is used to identify cats and dogs, and a complex table is used to identify rare animals. No human intervention is required throughout the process.
Eliminate tedious searches : Through random sampling technology, there is no need to slowly try which r is the best as in traditional methods, which saves a lot of time and the training speed is 4-7 times faster than LoRA.
3. What scenarios is it suitable for?
Real-time Q&A : For example, intelligent customer service can quickly answer simple questions and "transform" into a more powerful mode when encountering complex questions, which is fast and accurate.
Edge-side AI : On devices with limited computing power such as mobile phones, dynamic adjustments based on tasks can save power while ensuring results.
DyLoRA allows the model to grow larger or smaller like Sun Wukong, and automatically switch forms when encountering different tasks, which is fast and labor-saving.
6. Technical Comparison
7. Future Trends
These technologies not only reduce costs, but also bring about an "AI democratization revolution":
Ordinary people can play with large models : In the past, training large models was the patent of large companies. Now individual developers can fine-tune a 65 billion parameter model with a graphics card worth a few thousand yuan.
The model understands personalized needs better : for example, AI that helps children with homework can dynamically adjust the difficulty based on the child’s level, and AI that treats the elderly can simplify professional terms. These all require dynamic fine-tuning technology.
Challenges still exist : for example, the problem of "behavioral explanation" after the model is dynamically adjusted - why r=10 is used this time and r=20 is used next time? Further research is needed to make the model "transparent".
8. Conclusion
From LoRA to DyLoRA, the re-parameterization fine-tuning technology is like installing an "energy-saving engine" on a large model, bringing down the once unattainable AI gurus. When we can use ordinary computers to fine-tune models with tens of billions of parameters, it means that AI is no longer a toy for a few people, but a tool that everyone can master. In the future, perhaps the exclusive conversation model you design for your pet will become a hit in your circle of friends - all of this is becoming possible because of these technological breakthroughs.