Differences between LoRA and QLoRA

Explore the cutting-edge technology of AI model fine-tuning and learn how LoRA and QLoRA optimize resource consumption and improve efficiency.
Core content:
1. Advantages of parameter efficient fine-tuning (PEFT) technology compared with traditional fine-tuning
2. Differences in the technical principles and operations of LoRA and QLoRA
3. How LoRA technology achieves resource conservation and efficiency improvement
LoRA vs. QLoRA: What’s the Difference?
Both LoRA (Low Rank Adaptation) and QLoRA (Quantized Low Rank Adaptation) techniques can be used to train AI models. More specifically, they are both Parameter Efficient Fine-tuning (PEFT) , a fine-tuning technique that is gaining popularity as it is more resource-efficient than other methods used to train large language models (LLMs).
Both LoRA and QLoRA help fine-tune LLM more efficiently, but they differ in how they operate the model and utilize storage to achieve the desired results.
How do LoRA and QLoRA differ from traditional fine-tuning?
LLMs are complex models that consist of a large number of parameters, some of which may contain billions of parameters. With these parameters, the model can be trained based on a certain amount of information. The more parameters, the more storage space the data takes up, which generally means a more powerful model.
Traditional fine-tuning requires refitting (updating or adjusting) each individual parameter in order to update the LLM. This could mean fine-tuning billions of parameters, which requires a lot of computing time and money.
Each parameter update can lead to “overfitting,” a term that refers to when an AI model learns “noise,” or useless data, in addition to the regular training data.
Imagine a teacher lecturing in class. The class has been studying math all year. Before the exam, the teacher emphasizes the importance of long division. During the exam, many students find themselves focusing too much on long division and forgetting the key math equations, which are just as important for some problems. This is the effect of overfitting in traditional fine-tuning on LLM.
In addition to the overfitting problem, traditional fine-tuning also incurs huge resource costs.
QLoRA and LoRA are both fine-tuning techniques that can quickly improve the efficiency of comprehensive fine-tuning. Instead of training all parameters, they decompose the model into matrices and only train the parameters required to learn new information.
Following our metaphor, these fine-tuning techniques are efficient in introducing new topics without distracting the model from other topics in the test.
How does LoRA work?
LoRA technology trains the AI model based on new data using new parameters.
Instead of training the entire model and all pre-trained weights, they are set aside or "frozen" and the parameters of a smaller sample size are trained. These sample sizes are called "low-rank" adaptation matrices , which is where LoRA comes from.
They are called low-rank matrices because they have a small number of parameters and weights. After training, they are combined with the original parameters and then used as a single matrix. This allows for more efficient fine-tuning.
It is easier to understand the LoRA matrix by thinking of it as a row or column added to a matrix.
Think of the following matrix as all the parameters that need to be trained:
Training all the weights in parameters takes a lot of time, money, and memory. Once training is complete, you may still need to train more, which wastes a lot of resources.
The following column represents the low-rank weights:
Once the new low-rank parameters have been trained, a single "row" or "column" is added to the original matrix. This allows the new training to be applied to all parameters.
Now the AI model can be run with the newly fine-tuned weights.
Training low-rank weights requires less time, memory, and cost. Once the sample size is trained, it can apply what it has learned in a larger matrix without taking up any additional memory.
Advantages of LoRA
When using LoRA technology, models can be fine-tuned with less time, resources, and effort. Benefits include:
Fewer parameters need to be trained. There is less risk of overfitting. Shorter training time. Uses less memory. Flexible (you can train only certain parts of the model and ignore others).
How does QLoRA work?
QLoRA is an extension of LoRA. It is a similar technology to LoRA, but with an additional advantage: less memory required.
The “Q” in “QLoRA” stands for “quantized.” In this case, quantizing a model means compressing very complex, precise parameters (lots of small numbers and lots of memory) into smaller, more concise parameters (fewer small numbers and less memory).
Its goal is to fine-tune part of a model using the storage and memory of a single graphics processing unit (GPU). It does this using 4-bit NormalFloat (NF4), a new data type that quantizes matrices and requires even less memory than LoRA . By compressing parameters into smaller, more manageable data, it can reduce the required memory footprint to 4 times the original size.
After the model is quantized, its size becomes smaller, so it becomes much easier to fine-tune it.
Think of the following line as the parameters of the original model:
// Figure Figure https://www.redhat.com/rhdc/managed-files/WholeParameterV4_Original-model-parameter%20copy%203.png
There are 12 parameters in total, 3 green, 6 blue, 2 yellow, and 1 pink. After the model is quantized, it is compressed into the representation of the previous model.
// Figure Figure https://www.redhat.com/rhdc/managed-files/WholeParameterV4_Quantized-model%20copy%203.png
After quantification, the remaining sample sizes are 1 green, 2 blue, and 1 yellow.
During the quantization process, some data may be lost in the compression process because it is too small. For example, 1 pink parameter is lost because it is too small in the parameter set to represent enough data in the compressed version.
In the example above, we compressed the parameters from 12 to 4. But in reality, billions of parameters are compressed into a few that can be fine-tuned in a controllable manner on a single GPU.
Ideally, once the newly trained matrix is added back to the original matrix, any missing data would likely be recovered from the original parameters with no loss of precision or accuracy, but this is not necessarily the case.
This technology combines high-performance computing with easy-to-maintain memory storage. This allows the model to maintain extremely high accuracy even with limited resources.
Advantages of QLoRA
QLoRA technology focuses on easily maintainable memory requirements. Similar to LoRA, it prioritizes efficiency, enabling a faster and easier fine-tuning training process. Its advantages include:
Requires less memory than LoRA Helps avoid overfitting of data Maintains high accuracy Fast and lightweight model fine-tuning
What is the difference between LoRA and QLoRA?
LoRA itself is an efficient fine-tuned technology. QLoRA is an extension that adds multiple layers of technology to improve efficiency. QLoRA requires significantly less storage space.
If you are struggling with which technology to use for your needs, it is recommended to consider how much storage space and resources you have. If storage space is limited, it will be easier to use QLoRA.