How many types of precision are involved in large models? What are the relationships between FP32, TF32, FP16, BF16, FP8, FP4, NF4, and INT8? Let's explain it in one article

Written by

Clara Bennett

Updated on:July-17th-2025

The training and reasoning of large models often involve the concept of precision, which comes in many types. Moreover, at the same level of precision, there are different formats. I haven’t seen an article on the Internet that can give a comprehensive introduction, so here is a summary of a comprehensive introduction.

Overall Introduction

Floating point precision: double precision (FP64), single precision (FP32, TF32), half precision (FP16, BF16), 8-bit precision (FP8), 4-bit precision (FP4, NF4)

Quantization accuracy: INT8, INT4 (also INT3/INT5/INT6)

In addition, in actual usage scenarios, there are also the concepts of multi-precision and mixed precision.

What is precision?

Assuming that you earn 1 yuan per second, your monthly income is 1*60*60*24*30=216,000. If you earn 1.1 yuan per second, your monthly income is 237,600. Just a decimal point of 10 cents can reduce your monthly income by more than 10,000. This is the difference caused by different precision.

Another typical example is π, which is often expressed as 3.14, but if higher precision is required, there can be an infinite number of digits after the decimal point.

Of course, these are all concepts of precision in mathematics. In computers, the precision of floating-point numbers is related to the storage method. The more bits occupied, the higher the precision.

Why so much precision?

Because of cost and accuracy.

We all know that high precision is definitely more accurate, but it also brings higher computing and storage costs. Lower precision will reduce computing accuracy, but can improve computing efficiency and performance. Therefore, a variety of different precisions allow you to choose the most suitable one in different situations.

Double precision is more accurate than single precision, but it takes up twice as much storage and takes more time to calculate. If single precision is sufficient, there is no need for double precision.

Different floating point precisions

In computers, floating point numbers are stored in three parts: sign, exponent, and fraction. The sign bit is always 1, the exponent affects the range of the floating point number, and the fraction affects the precision.

[FP precision]

Floating Point is the most primitive standard floating point number type defined by IEEE. It consists of three parts: sign, exponent, and fraction.

FP64 is a 64-bit floating point number consisting of 1 sign bit, 11 exponent bits, and 52 fraction bits.

FP32, FP16, FP8, and FP4 are all similarly composed, except that the exponent bits and decimal places are different.

However, FP8 and FP4 are not standard formats of IEEE.

FP8 was defined by several chip manufacturers in September 2022. The paper address is: https://arxiv.org/abs/2209.05433

FP4 was defined by an academic institution in October 2023. The paper address is: https://arxiv.org/abs/2310.16836

There are two variants of the FP8 format, E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa).

The number of sign bits, exponent bits, and decimal places are shown in the following table:

Format	Sign bit	Exponent	Decimal Places	Total digits
FP64	1	11	52	64
FP32	1	8	twenty three	32
FP16	1	5	10	16
FP8 E4M3	1	4	3	8
FP8 E5M2	1	5	2	8
FP4	1	2	1	4

[ Special precision]

TF32, Tensor Float 32, is a special numerical type designed by NVIDIA for machine learning, which is used to replace FP32. It is supported for the first time in A100 GPU.

It consists of 1 sign bit, 8 exponent bits (aligned to FP32), and 10 fraction bits (aligned to FP16), so there are only 19 bits in total. It achieves a balance between performance, range, and accuracy.

Check whether it is supported in Python:

import torch
//Whether to support tf32
torch.backends.cuda.matmul.allow_tf32
//Whether to allow tf32, the default is False in PyTorch1.12 and higher
torch.backends.cudnn.allow_tf32
‍

BF16, Brain Float 16, was proposed by Google Brain and is also designed for machine learning. It consists of 1 sign bit, 8 exponent bits (consistent with FP32) and 7 decimal bits (lower than FP16). So the accuracy is lower than FP16, but the representation range is consistent with FP32, and it is easy to convert between FP32 and FP32.

On NVIDIA GPUs, only Ampere architecture and later GPUs are supported.

Check whether it is supported in Python:

import transformers
transformers.utils.import_utils.is_torch_bf16_gpu_available()

NF4, 4-bit NormalFloat, a special format for quantization, was proposed by the University of Washington in the QLoRA quantization paper in May 23. The paper address is: https://arxiv.org/abs/2305.14314

NF4 is an information-theoretically optimal data type based on quantile quantization technology . Normalize a 4-bit number to a fixed expected value of a normal distribution with a mean of 0 and a standard deviation of [-1,1]. Those who know the principle of quantization should understand it.

The number of FP precision and special precision plus the number of digits is summarized in the following table

Format	Sign bit	Exponent	Decimal Places	Total digits
FP64	1	11	52	64
FP32	1	8	twenty three	32
TF32	1	8	10	19
BF16	1	8	7	16
FP16	1	5	10	16
FP8 E4M3	1	4	3	8
FP8 E5M2	1	5	2	8
FP4	1	2	1	4

Multi-precision and mixed-precision

Multi-precision computing refers to computing with different precisions. Double precision is used in parts that require high-precision computing, and half-precision or single-precision is used in other parts.

Mixed precision computing is the use of different precision levels in a single operation, which achieves computational efficiency without sacrificing accuracy, reducing the memory, time and power consumption required for operation

Quantization accuracy

Generally speaking, the lower the precision, the smaller the model size and inference memory usage. In order to reduce resource usage as much as possible, quantization algorithms were invented. FP32 occupies 4 bytes, and quantization to 8 bits only requires 1 byte.

INT8 and INT4 are commonly used, and there are other quantization formats (6-bit, 5-bit, and even 3-bit). Although the resource usage is reduced, the inference results are not much different.

The quantization algorithm is not elaborated in detail here.