From Float64 to INT4: The underlying logic and scenario adaptation of large model precision selection

Written by

Jasper Cole

Updated on:June-24th-2025

In the field of deep learning, model accuracy is not only a technical parameter, but also a tool for bargaining between performance and cost. This article will analyze 8 core accuracy solutions from Float64 to INT4 based on technical principles and actual cases , helping you find the "golden section" that best suits your business scenario.

1. High-precision camp: the guardian of scientific computing

Float64 (double-precision floating point number)

Technical features : 64-bit binary storage is used, providing about 15-16 significant digits, and a dynamic range of 1e-308 to 1e+308.
Core scenarios :

High-precision scientific computing (such as quantum mechanics simulation)
Numerical stability requirements in financial risk control systems
Small gradient calculations during neural network weight initialization

Limitations : The video memory usage is twice that of FP32, and the computing speed is reduced by 40%-60%.

Float32 (single-precision floating point number)

Industry status : The current standard accuracy for deep learning training, sovereign weights are always stored in FP32 to ensure gradient update stability.
Typical applications :

Main weight storage during model training
Accuracy-sensitive medical image segmentation tasks
Reward Function Calculation in Reinforcement Learning

2. Balancing factions: the art of compromise between efficiency and precision

BFloat16 (brain floating point number)

Technological breakthrough : 16-bit storage is achieved by truncating the FP32 mantissa bits, maintaining the same exponent bit width (8 bits) as FP32, with a dynamic range loss of only 0.01%.
Advantageous scenarios :

Native support for Google TPU ecosystem (accelerated matrix operations)
Mixed precision training in the pre-training phase of large models
Word vector calculation in natural language processing

Float16 (half-precision floating point)

Performance leap : Compared with FP32, the memory usage is reduced by 50%, and the computing speed is increased by 2-3 times on NVIDIA Ampere architecture GPU.
Risk Warning :

The probability of exploding/vanishing gradients increases by 30%
Need to be used with Loss Scaling technology

Mature applications :

Stable Diffusion Wensheng Graph Model Reasoning
Acoustic Models for Real-Time Speech Recognition

3. Low-precision revolution: the law of survival in the mobile Internet era

INT8 (8-bit integer)

Quantization revolution : Mapping floating-point numbers to the integer space of -128~127, compressing the model volume to 1/4, and increasing the CUDA core computing throughput by 8 times.
Industrial practice :

Lightweight reasoning on mobile phones (such as MobileNetV3)
Real-time object detection on edge devices
Rough Ranking Model for E-commerce Recommendation System

INT4 (4-bit integer)

Extreme optimization : The model size is compressed to 1/8 in exchange for a representation range of -8 to 7. A single A100 graphics card can run a model with over 10 billion parameters.
Technical Challenges :

The accuracy loss is 10%-15% (data distribution needs to be calibrated)
Does not support direct quantization of complex activation functions

Breakthrough Case :

Meta's LLaMA-7B INT4 Quantitative Version
Alibaba Cloud Qwen's mobile conversation engine
Offline voice assistant on smartwatch

IV. Precision Selection Decision Tree (with Comparison Table)

Precision Type	Bit width	Representation range	Video memory saving ratio	Typical application scenarios	Loss of precision
FP64	64	±5e-324~1e308	0%	Scientific computing/financial risk control	none
FP32	32	±1e-38~3.4e38	50%	Model training/core inference	none
BF16	16	±1e-38~3.4e38	66%	TPU acceleration/mixed precision training	Low
FP16	16	±6e-8~6.5e4	66%	Image generation/speech recognition	middle
INT8	8	-128~127	75%	Edge device deployment	Medium to high
INT4	4	-8~7	87.5%	Ultra-low latency scenarios on mobile devices	high

5. Practical suggestions: precision combination strategies at different stages

Training phase : FP32 (master weight) + FP16 (temporary calculation) + BF16 (gradient aggregation)
Deployment phase :

Cloud service: FP16+INT8 mixed precision (dynamic switching)
Mobile terminal: INT4 quantization + CPU/GPU heterogeneous computing

Special scenarios :

Medical diagnosis model: FP32 full calculation
Game NPC dialogue system: INT8+FP16 hybrid reasoning