Detailed explanation and suggestions on large model quantification methods

Written by
Clara Bennett
Updated on:July-01st-2025
Recommendation

Master the large model quantization technology to improve model performance and efficiency.

Core content:
1. The basic concepts and advantages of model quantization
2. Detailed analysis of the technical details and applicable scenarios of different quantization methods
3. Performance comparison analysis and quantization method selection suggestions

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

The following is a detailed technical analysis of model quantization methods (such as  q4_0, q5_K_M, q8_0 ), combined with the latest industry practices and research results:

1. Overview of Quantification Methods

Model quantization reduces the model size, improves inference speed, and reduces power consumption by reducing the precision of weights and activation values ​​(such as FP32 → INT8). Different quantization methods have significant differences in accuracy, computing efficiency, and hardware support.

2. Detailed explanation of common quantitative methods

1. q4_0 (4-bit quantization)

  • Technical details:

    • Weights and activations are quantized to 4-bit integers with a group size of 32.

    • Symmetric quantization is used, and the quantization parameters (scale/zero-point) are stored as FP16.

  • advantage:

    • The model size is greatly reduced (FP32 → q4_0 is about 1/8).

    • Suitable for memory-constrained scenarios (such as mobile terminals and embedded devices).

  • shortcoming:

    • The loss of accuracy is large, and the performance of complex tasks (such as natural language understanding) decreases significantly.

    • Some hardware does not support 4-bit calculations and needs to be converted to higher precision (such as INT8).

2. q5_K_M (5-bit hybrid quantization)

  • Technical details:

    • The weights are divided into two parts: a high-precision part (5-bit) and a low-precision part (4-bit), which are mixed in proportion.

    • Asymmetric quantization is used and the quantization parameters are stored as FP16.

  • advantage:

    • Compared with pure 4-bit quantization, it has higher accuracy (e.g., the perplexity of Llama3-8B q5_K_M is reduced by 15%).

    • The computational efficiency is close to q4_0, which is suitable for mid-range hardware (such as consumer-grade GPUs).

  • shortcoming:

    • The model size is slightly larger than q4_0 (q5_K_M is about 1/6 of FP32).

    • The implementation is complex and requires customized quantization logic.

3. q8_0 (8-bit quantization)

  • Technical details:

    • Weights and activations are quantized to 8-bit integers with a group size of 32.

    • Symmetric quantization is used and the quantization parameters are stored as FP16.

  • advantage:

    • The loss of accuracy is minimal (e.g., the perplexity of Llama3-8B q8_0 is close to FP32).

    • Broad hardware support (e.g. NVIDIA Tensor Core, Intel VNNI).

  • shortcoming:

    • The model size is relatively large (q8_0 is about 1/4 of FP32).

    • The computational efficiency is lower than that of low-bit quantization (such as q4_0/q5_K_M).


III. Performance comparison (Llama3-8B example)

Quantification method
Model volume
Inference speed (tokens/s)
Perplexity (PPL)
Applicable scenarios
FP32
13.5 GB
25~30
3.12
High Performance Computing
q8_0
3.5 GB
50~60
3.15
General Hardware
q5_K_M
2.1 GB
75~85
3.28
Mid-range hardware
q4_0
1.7 GB
90~100
3.75
Memory-constrained devices
No quantification
4.7G
35~40 
3.10
Uncompressed original precision model

Note: The test environment is NVIDIA RTX 4090, batch size=1.

IV. Suggestions for choosing a quantitative method

  • Precision priority: Select q8_0, which is suitable for scenarios with high task performance requirements (such as financial analysis and legal document processing).

  • Balance accuracy and efficiency: choose q5_K_M, suitable for mid-range hardware (such as RTX 3060/Intel Arc).

  • Extreme compression: Select q4_0, which is suitable for memory-constrained devices (such as embedded systems and mobile phones).

  • Hardware compatibility: Confirm that the target hardware supports low-level compute (e.g. NVIDIA Ampere architecture supports INT4).


5. Future Trends

  • Adaptive quantization: Dynamically adjust quantization parameters based on input data (such as Microsoft's Adaptive Quantization).

  • Extremely low-bit quantization: Explore 2-bit quantization and restore accuracy in combination with knowledge distillation.

  • Hardware-algorithm co-design: such as Huawei's block quantization patent, which optimizes the matching between computing units and quantization strategies.