How much video memory is needed for private deployment of large models?

Written by
Caleb Hayes
Updated on:June-22nd-2025
Recommendation

Deeply understand the memory requirements of large AI model deployment to avoid huge waste of hardware costs.

Core content:
1. The two core components of large model memory requirements: static parameter memory and dynamic calculation cache
2. Calculation formulas for parameter memory and KV cache and actual case analysis
3. Hidden cost: the impact of intermediate activation values ​​on memory and optimization methods

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

 

Recently, I have come into contact with many AI projects.

The customer asked right away: How much computing power is needed to deploy a large model?

"Boss, how much video memory do I need for a 32B model?" "Is 90GB enough?" "No, you may have underestimated by 30%!"

As billion-scale models become the industry standard, errors in video memory estimation can easily lead to hardware cost deviations of hundreds of thousands.

Today, I will try to explain computing power evaluation in a simple way by taking the deployment of the QwQ-32B model and the FB16 precision video memory requirement as an example. (A large autoregressive language model developed by Alibaba's Qwen team based on the Qwen2.5 architecture).

PS: On the day the article was published, Qwen3 was released, which greatly reduced the performance requirements. A related article will be published later to introduce it.

Video memory core components

The memory usage of large models is mainly composed of two modules: static parameter memory and dynamic calculation cache . The two together determine the baseline standard for hardware selection.

1. Parameter memory: static memory of the model

Calculation formula :

Parameter quantity x parameter precision byte number = parameter video memory (GB)

Take QwQ-32B as an example : Parameter size: 32B (32 × 10⁹ 32 billion) Precision: FP16 (2 bytes/parameter) Calculated according to the formula: 32×10⁹×2 = 64×10⁹bytes ≈  64 GB



Calculation trap :

The above is a common reasoning scenario (AI chat). If it is a training scenario , additional gradient storage is required and the video memory is doubled to 128GB .

In addition, 1B=10⁹ (International Unit) ≠ Chinese "亿" (10⁸)

PS: In training scenarios, gradient calculation is usually enabled, which is mainly used to optimize model parameters to minimize prediction errors. It is a key step in training scenarios. In this scenario, the video memory must be at least doubled.

2. KV Cache: Memory Black Hole for Dynamic Reasoning

To speed up the inference process, it is necessary to useKV Cache, which is a standard feature of Transformer. The following calculations are based on typical Transformer model estimates. When the model processes long sequences, it needs to store a Key-Value vector for each token.

Calculation formula :

Number of layers × 2 × number of heads × head dimension × context length × number of element bytes = KV cache memory (GB)

Take QwQ-32B as an example :

  • • Number of layers (L): 40
  • • Number of attention heads (h): 64
  • • Dimensions of each head (d_head): 128
  • • Context length (S): 16,000 tokens (16k)
  • • Number of bytes per element: 2 bytes for FP16

Calculated according to the formula: 40 layers × 2 × 64 heads × 128 dimensions × 16,000 tokens × 2 bytes ≈  19.54GB

It is worth mentioning that DeepSeek's pioneering MLA technology reduces KV Cache by 93%, effectively reducing the requirements for high-performance GPUs in training and inference scenarios.

Hidden costs that cannot be ignored

1. Intermediate activations: transient memory of computational processes

Intermediate activation values ​​are temporary data generated by calculations at each layer during the forward propagation of the neural network. They are like semi-finished products at each step in the cooking process and must be temporarily saved for subsequent operations.

  • •  Training scenario : Activation memory can reach 5-7 times the number of parameters (32x5 ~ 32x7 requires about 160-224GB)
  • •  Inference scenario : Operator fusion (such as Flash Attention) can significantly reduce the activation value usage, but it may still occupy 5~10GB of video memory

2. Framework overhead: system "management fee"

Frameworks such as PyTorch require 10-20% additional video memory to handle memory management, just like aisle space needs to be reserved in a warehouse for people/goods to enter and exit.

3. Security redundancy: the last line of defense for system stability

To ensure stable operation, it is recommended to reserve at least 20% buffer space. Therefore, in normal inference scenarios, the total video memory calculation formula is:

Total video memory = (parameter video memory + KV cache) × 1.2

(64+19.54)×1.2 ≈ 100GB

It is safer to choose a 128GB graphics card (the smallest card memory is 24G/card).

Recommended hardware configuration

Scenario-based Selection Guide

Scenario
Parameters memory
KV Cache
Reserve space
Recommended Configuration
Adaptation Hardware Example
Training scenario
128GB
-
≥30GB
≥256GB
NVIDIA H100 80GB ×4
Batch Inference
64GB
19.54GB
15GB
≥128GB
NVIDIA A100 80GB ×2
Single card reasoning
64GB
19.54GB
6GB
≥90GB
NVIDIA H100 80GB

The KV cache requirement in the training scenario is large, and the intermediate value activation memory can reach 5-7 times the number of parameters. (The calculation process is above)

Hardware card video memory reference: A100 80GB: can meet the single card extreme reasoning (video memory optimization needs to be enabled) A100 40GB: requires model parallelism (more than 2 cards) H100 80GB: recommended for future deployment


Of course, you can also choose some domestic cards, such as the Muxixiyun C series, Hygon DCU, Ascend series, etc.

Summarize

When asked again: "How much video memory is needed for a privately deployed 32B model?"

You can confidently respond: "It takes about 90GB to meet basic reasoning needs, and 128GB is required for security and stability!"