Detailed explanation and suggestions on large model quantification methods

Written by

Clara Bennett

Updated on:July-01st-2025

The following is a detailed technical analysis of model quantization methods (such as q4_0, q5_K_M, q8_0 ), combined with the latest industry practices and research results:

1. Overview of Quantification Methods

Model quantization reduces the model size, improves inference speed, and reduces power consumption by reducing the precision of weights and activation values (such as FP32 → INT8). Different quantization methods have significant differences in accuracy, computing efficiency, and hardware support.

2. Detailed explanation of common quantitative methods

1. q4_0 (4-bit quantization)

Technical details:

Weights and activations are quantized to 4-bit integers with a group size of 32.
Symmetric quantization is used, and the quantization parameters (scale/zero-point) are stored as FP16.

advantage:

The model size is greatly reduced (FP32 → q4_0 is about 1/8).
Suitable for memory-constrained scenarios (such as mobile terminals and embedded devices).

shortcoming:

The loss of accuracy is large, and the performance of complex tasks (such as natural language understanding) decreases significantly.
Some hardware does not support 4-bit calculations and needs to be converted to higher precision (such as INT8).

2. q5_K_M (5-bit hybrid quantization)

Technical details:

The weights are divided into two parts: a high-precision part (5-bit) and a low-precision part (4-bit), which are mixed in proportion.
Asymmetric quantization is used and the quantization parameters are stored as FP16.

advantage:

Compared with pure 4-bit quantization, it has higher accuracy (e.g., the perplexity of Llama3-8B q5_K_M is reduced by 15%).
The computational efficiency is close to q4_0, which is suitable for mid-range hardware (such as consumer-grade GPUs).

shortcoming:

The model size is slightly larger than q4_0 (q5_K_M is about 1/6 of FP32).
The implementation is complex and requires customized quantization logic.

3. q8_0 (8-bit quantization)

Technical details:

Weights and activations are quantized to 8-bit integers with a group size of 32.
Symmetric quantization is used and the quantization parameters are stored as FP16.

advantage:

The loss of accuracy is minimal (e.g., the perplexity of Llama3-8B q8_0 is close to FP32).
Broad hardware support (e.g. NVIDIA Tensor Core, Intel VNNI).

shortcoming:

The model size is relatively large (q8_0 is about 1/4 of FP32).
The computational efficiency is lower than that of low-bit quantization (such as q4_0/q5_K_M).

III. Performance comparison (Llama3-8B example)

Quantification method	Model volume	Inference speed (tokens/s)	Perplexity (PPL)	Applicable scenarios
FP32	13.5 GB	25~30	3.12	High Performance Computing
q8_0	3.5 GB	50~60	3.15	General Hardware
q5_K_M	2.1 GB	75~85	3.28	Mid-range hardware
q4_0	1.7 GB	90~100	3.75	Memory-constrained devices
No quantification	4.7G	35~40	3.10	Uncompressed original precision model

Note: The test environment is NVIDIA RTX 4090, batch size=1.

IV. Suggestions for choosing a quantitative method

Precision priority: Select q8_0, which is suitable for scenarios with high task performance requirements (such as financial analysis and legal document processing).
Balance accuracy and efficiency: choose q5_K_M, suitable for mid-range hardware (such as RTX 3060/Intel Arc).
Extreme compression: Select q4_0, which is suitable for memory-constrained devices (such as embedded systems and mobile phones).
Hardware compatibility: Confirm that the target hardware supports low-level compute (e.g. NVIDIA Ampere architecture supports INT4).

5. Future Trends

Adaptive quantization: Dynamically adjust quantization parameters based on input data (such as Microsoft's Adaptive Quantization).
Extremely low-bit quantization: Explore 2-bit quantization and restore accuracy in combination with knowledge distillation.
Hardware-algorithm co-design: such as Huawei's block quantization patent, which optimizes the matching between computing units and quantization strategies.