Large model quantization technology: mainstream method analysis and code practice

Written by
Jasper Cole
Updated on:July-01st-2025
Recommendation

Mastering large model quantization technology, it is no longer a problem to efficiently deploy trillion-parameter models.

Core content:
1. The role and classification of quantization technology: model compression, inference acceleration, memory reduction
2. Detailed explanation of 5 mainstream quantization methods: including core technologies such as GPTQ, AWQ, QLoRA
3. Formula and code practice: key steps to quickly get started with quantization technology

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

As the number of large model parameters exceeds one trillion, efficient deployment becomes a key challenge. Quantization technology significantly reduces model storage and computing overhead by converting high-precision floating-point numbers into low-bit integers. This article explains five mainstream large model quantization methods in detail, covering their functions, architectures, and innovations, and provides formulas and code examples to help you quickly master the core technologies.

1. The role and classification of quantitative techniques

Core role :

  1. Compress model size : For example, quantizing a 7B parameter FP32 model (28GB) to INT8 (7GB) reduces the size by 75%.
  2. Accelerated reasoning : Low-precision integer operations are much faster than floating-point operations, and are particularly suitable for real-time reasoning on the GPU/CPU side.
  3. Reduce memory usage : Quantize activation values ​​and KV cache to improve the throughput of long sequence generation.

Quantization classification :
•  Post-training quantization (PTQ) : directly quantize the pre-trained model without fine-tuning (such as GPTQ, SmoothQuant).
•  Quantization-aware training (QAT) : simulate quantization errors during training to improve the final accuracy (such as QLoRA).

2. Detailed explanation of mainstream quantitative methods

1.  GPTQ (Generalized Post-Training Quantization)

Function : An efficient PTQ solution for GPU inference, supporting 4-bit quantization with minimal precision loss.
Architecture and advancement :
•  Layer-by-layer optimization : Quantize in order of Transformer layers to avoid cumulative errors.
•  Hessian matrix approximation : Calculate quantization errors through second-order derivatives and dynamically adjust the optimal weight value.
•  Formula :

in,is the Hessian matrix,is the original weight,is the quantized weight.

Code example (using the AutoGPTQ library):

from  auto_gptq  import  AutoGPTQForCausalLM  
model = AutoGPTQForCausalLM.from_quantized( "TheBloke/Llama-7B-GPTQ" , use_safetensors= True )  
print(model.generate( "Hello!" ))  

2.  AWQ (Activation-Aware Weight Quantization)

Function : Activation-aware quantization for edge devices, taking into account both accuracy and computational efficiency.
Architecture and advancement :
•  Mixed precision retention : retain FP16 for key weights and quantize secondary weights to 4-bit to reduce information loss.
•  Hardware-friendly design : adapt to CPU/low-power GPU, and increase inference speed by 2-3 times.

Code example (load AWQ model):

from  awq  import  AutoAWQForCausalLM  
model = AutoAWQForCausalLM.from_quantized( "TheBloke/Llama-7B-AWQ" )  
output = model.generate( "What is AI?" )  

3.  QLoRA (Quantized Low-Rank Adaptation)

Function : Supports 4-bit fine-tuning QAT solution, suitable for low-resource scenarios.
Architecture and advancement :
•  Double quantization : Secondary compression of LoRA adapter, reducing storage overhead by 40%.
•  NF4 data type : 4-bit quantization based on normal distribution, more suitable for large model weight distribution than INT4/FP4.

Quantization formula (asymmetric quantization):

4.  SmoothQuant

Function : Solve the problem of abnormal distribution of activation values ​​and realize joint quantization of weights and activations.
Architecture and advancement :
•  Dynamic scaling factor : Calculate the scaling ratio of weights and activations based on calibration data to balance quantization errors.
•  Formula (scaling factor calculation):

in,is the weight,is the activation value.

5.  BitsandBytes (dynamic quantization library)

Function : Hugging Face ecosystem lightweight tool, supports 8/4-bit dynamic quantization.
Architecture and advancement :
•  Dynamic dequantization : restore INT8 to FP16 in real time during inference, compatible with all Transformer models.
•  Low video memory usage : 13B models can be run on T4 graphics cards, reducing video memory requirements by 50%.

Code example (4-bit quantization):

from  transformers  import  AutoModelForCausalLM  
from  bitsandbytes  import  BitsAndBytesConfig  

quant_config = BitsAndBytesConfig(load_in_4bit= True )  
model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b" , quantization_config=quant_config)  

3. Technical comparison and selection suggestions

method
Applicable scenarios
Loss of precision
Hardware Support
GPTQ
GPU high performance inference
<1%
NVIDIA GPU
AWQ
Edge Device/CPU
1-2%
General Purpose Processor
QLoRA
Low resource fine-tuning
Ignore
Low memory GPU
BitsandBytes
Rapid prototyping
2-3%
All devices

Conclusion (80% of articles with conclusions are written by ai, but i'm not)

Quantization technology is driving the implementation of large models in all industries. Whether to choose PTQ or QAT, it is necessary to combine hardware conditions and accuracy requirements. In the future, hybrid quantization (such as FP8+INT4) and sparse quantization may become new directions.