How many types of precision are involved in large models? What are the relationships between FP32, TF32, FP16, BF16, FP8, FP4, NF4, and INT8? Let's explain it in one article

In-depth analysis of the accuracy issues in large model training, from FP32 to INT8, this article will give you a comprehensive understanding.
Core content:
1. Classification and definition of floating point precision and quantization precision
2. The impact of different precisions on computing cost and accuracy
3. Selection and usage scenarios of various precisions in practical applications
The training and reasoning of large models often involve the concept of precision, which comes in many types. Moreover, at the same level of precision, there are different formats. I haven’t seen an article on the Internet that can give a comprehensive introduction, so here is a summary of a comprehensive introduction.
Overall Introduction
Floating point precision: double precision (FP64), single precision (FP32, TF32), half precision (FP16, BF16), 8-bit precision (FP8), 4-bit precision (FP4, NF4)
Quantization accuracy: INT8, INT4 (also INT3/INT5/INT6)
In addition, in actual usage scenarios, there are also the concepts of multi-precision and mixed precision.
What is precision?
Assuming that you earn 1 yuan per second, your monthly income is 1*60*60*24*30=216,000. If you earn 1.1 yuan per second, your monthly income is 237,600. Just a decimal point of 10 cents can reduce your monthly income by more than 10,000. This is the difference caused by different precision.
Another typical example is π, which is often expressed as 3.14, but if higher precision is required, there can be an infinite number of digits after the decimal point.
Of course, these are all concepts of precision in mathematics. In computers, the precision of floating-point numbers is related to the storage method. The more bits occupied, the higher the precision.
Why so much precision?
Because of cost and accuracy.
We all know that high precision is definitely more accurate, but it also brings higher computing and storage costs. Lower precision will reduce computing accuracy, but can improve computing efficiency and performance. Therefore, a variety of different precisions allow you to choose the most suitable one in different situations.
Double precision is more accurate than single precision, but it takes up twice as much storage and takes more time to calculate. If single precision is sufficient, there is no need for double precision.
Different floating point precisions
In computers, floating point numbers are stored in three parts: sign, exponent, and fraction. The sign bit is always 1, the exponent affects the range of the floating point number, and the fraction affects the precision.
[FP precision]
Floating Point is the most primitive standard floating point number type defined by IEEE. It consists of three parts: sign, exponent, and fraction.
FP64 is a 64-bit floating point number consisting of 1 sign bit, 11 exponent bits, and 52 fraction bits.
FP32, FP16, FP8, and FP4 are all similarly composed, except that the exponent bits and decimal places are different.
However, FP8 and FP4 are not standard formats of IEEE.
FP8 was defined by several chip manufacturers in September 2022. The paper address is: https://arxiv.org/abs/2209.05433
FP4 was defined by an academic institution in October 2023. The paper address is: https://arxiv.org/abs/2310.16836
There are two variants of the FP8 format, E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa).
The number of sign bits, exponent bits, and decimal places are shown in the following table:
Format | Sign bit | Exponent | Decimal Places | Total digits |
FP64 | 1 | 11 | 52 | 64 |
FP32 | 1 | 8 | twenty three | 32 |
FP16 | 1 | 5 | 10 | 16 |
FP8 E4M3 | 1 | 4 | 3 | 8 |
FP8 E5M2 | 1 | 5 | 2 | 8 |
FP4 | 1 | 2 | 1 | 4 |
[ Special precision]
TF32, Tensor Float 32, is a special numerical type designed by NVIDIA for machine learning, which is used to replace FP32. It is supported for the first time in A100 GPU.
It consists of 1 sign bit, 8 exponent bits (aligned to FP32), and 10 fraction bits (aligned to FP16), so there are only 19 bits in total. It achieves a balance between performance, range, and accuracy.
Check whether it is supported in Python:
import torch
//Whether to support tf32
torch.backends.cuda.matmul.allow_tf32
//Whether to allow tf32, the default is False in PyTorch1.12 and higher
torch.backends.cudnn.allow_tf32
BF16, Brain Float 16, was proposed by Google Brain and is also designed for machine learning. It consists of 1 sign bit, 8 exponent bits (consistent with FP32) and 7 decimal bits (lower than FP16). So the accuracy is lower than FP16, but the representation range is consistent with FP32, and it is easy to convert between FP32 and FP32.
On NVIDIA GPUs, only Ampere architecture and later GPUs are supported.
Check whether it is supported in Python:
import transformers
transformers.utils.import_utils.is_torch_bf16_gpu_available()
NF4, 4-bit NormalFloat, a special format for quantization, was proposed by the University of Washington in the QLoRA quantization paper in May 23. The paper address is: https://arxiv.org/abs/2305.14314
NF4 is an information-theoretically optimal data type based on quantile quantization technology . Normalize a 4-bit number to a fixed expected value of a normal distribution with a mean of 0 and a standard deviation of [-1,1]. Those who know the principle of quantization should understand it.
The number of FP precision and special precision plus the number of digits is summarized in the following table
Format | Sign bit | Exponent | Decimal Places | Total digits |
FP64 | 1 | 11 | 52 | 64 |
FP32 | 1 | 8 | twenty three | 32 |
TF32 | 1 | 8 | 10 | 19 |
BF16 | 1 | 8 | 7 | 16 |
FP16 | 1 | 5 | 10 | 16 |
FP8 E4M3 | 1 | 4 | 3 | 8 |
FP8 E5M2 | 1 | 5 | 2 | 8 |
FP4 | 1 | 2 | 1 | 4 |
Multi-precision and mixed-precision
Multi-precision computing refers to computing with different precisions. Double precision is used in parts that require high-precision computing, and half-precision or single-precision is used in other parts.
Mixed precision computing is the use of different precision levels in a single operation, which achieves computational efficiency without sacrificing accuracy, reducing the memory, time and power consumption required for operation
Quantization accuracy
Generally speaking, the lower the precision, the smaller the model size and inference memory usage. In order to reduce resource usage as much as possible, quantization algorithms were invented. FP32 occupies 4 bytes, and quantization to 8 bits only requires 1 byte.
INT8 and INT4 are commonly used, and there are other quantization formats (6-bit, 5-bit, and even 3-bit). Although the resource usage is reduced, the inference results are not much different.
The quantization algorithm is not elaborated in detail here.