From Float64 to INT4: The underlying logic and scenario adaptation of large model precision selection

Written by
Jasper Cole
Updated on:June-24th-2025
Recommendation

A key guide to selecting the accuracy of deep learning models, covering a comprehensive range of solutions from high precision to low precision.

Core content:
1. High-precision solution: technical characteristics and application scenarios of Float64 and Float32
2. Balancing factions: the art of compromise between efficiency and accuracy of BFloat16 and Float16
3. Low-precision revolution: quantization technology and industrial practice cases of INT8 and INT4

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

In the field of deep learning, model accuracy is not only a technical parameter, but also a tool for bargaining between performance and cost. This article will analyze 8 core accuracy solutions from Float64 to INT4 based on technical principles and actual cases , helping you find the "golden section" that best suits your business scenario.


1. High-precision camp: the guardian of scientific computing

Float64 (double-precision floating point number)

  • Technical features : 64-bit binary storage is used, providing about 15-16 significant digits, and a dynamic range of 1e-308 to 1e+308.
  • Core scenarios :
    • High-precision scientific computing (such as quantum mechanics simulation)
    • Numerical stability requirements in financial risk control systems
    • Small gradient calculations during neural network weight initialization
  • Limitations : The video memory usage is twice that of FP32, and the computing speed is reduced by 40%-60%.

Float32 (single-precision floating point number)

  • Industry status : The current standard accuracy for deep learning training, sovereign weights are always stored in FP32 to ensure gradient update stability.
  • Typical applications :
    • Main weight storage during model training
    • Accuracy-sensitive medical image segmentation tasks
    • Reward Function Calculation in Reinforcement Learning

2. Balancing factions: the art of compromise between efficiency and precision

BFloat16 (brain floating point number)

  • Technological breakthrough : 16-bit storage is achieved by truncating the FP32 mantissa bits, maintaining the same exponent bit width (8 bits) as FP32, with a dynamic range loss of only 0.01%.
  • Advantageous scenarios :
    • Native support for Google TPU ecosystem (accelerated matrix operations)
    • Mixed precision training in the pre-training phase of large models
    • Word vector calculation in natural language processing

Float16 (half-precision floating point)

  • Performance leap : Compared with FP32, the memory usage is reduced by 50%, and the computing speed is increased by 2-3 times on NVIDIA Ampere architecture GPU.
  • Risk Warning :
    • The probability of exploding/vanishing gradients increases by 30%
    • Need to be used with Loss Scaling technology
  • Mature applications :
    • Stable Diffusion Wensheng Graph Model Reasoning
    • Acoustic Models for Real-Time Speech Recognition

3. Low-precision revolution: the law of survival in the mobile Internet era

INT8 (8-bit integer)

  • Quantization revolution : Mapping floating-point numbers to the integer space of -128~127, compressing the model volume to 1/4, and increasing the CUDA core computing throughput by 8 times.
  • Industrial practice :
    • Lightweight reasoning on mobile phones (such as MobileNetV3)
    • Real-time object detection on edge devices
    • Rough Ranking Model for E-commerce Recommendation System

INT4 (4-bit integer)

  • Extreme optimization : The model size is compressed to 1/8 in exchange for a representation range of -8 to 7. A single A100 graphics card can run a model with over 10 billion parameters.
  • Technical Challenges :
    • The accuracy loss is 10%-15% (data distribution needs to be calibrated)
    • Does not support direct quantization of complex activation functions
  • Breakthrough Case :
    • Meta's LLaMA-7B INT4 Quantitative Version
    • Alibaba Cloud Qwen's mobile conversation engine
    • Offline voice assistant on smartwatch

IV. Precision Selection Decision Tree (with Comparison Table)

Precision Type
Bit width
Representation range
Video memory saving ratio
Typical application scenarios
Loss of precision
FP64
64
±5e-324~1e308
0%
Scientific computing/financial risk control
none
FP32
32
±1e-38~3.4e38
50%
Model training/core inference
none
BF16
16
±1e-38~3.4e38
66%
TPU acceleration/mixed precision training
Low
FP16
16
±6e-8~6.5e4
66%
Image generation/speech recognition
middle
INT8
8
-128~127
75%
Edge device deployment
Medium to high
INT4
4
-8~7
87.5%
Ultra-low latency scenarios on mobile devices
high

5. Practical suggestions: precision combination strategies at different stages

  1. Training phase : FP32 (master weight) + FP16 (temporary calculation) + BF16 (gradient aggregation)
  2. Deployment phase :
  • Cloud service: FP16+INT8 mixed precision (dynamic switching)
  • Mobile terminal: INT4 quantization + CPU/GPU heterogeneous computing
  • Special scenarios :
    • Medical diagnosis model: FP32 full calculation
    • Game NPC dialogue system: INT8+FP16 hybrid reasoning