Do you have enough GPU memory to handle large models? Learn how to estimate it in one article

Master GPU memory estimation techniques to efficiently deploy large models.
Core content:
1. Analysis of the relationship between GPU memory and model deployment
2. Key factors for estimating memory requirements
3. Memory estimation methods for inference and training scenarios
When working on daily projects, there is also an urgent need for the private deployment of large models. I spent some time to study in depth the relationship between model usage and GPU graphics card configuration and make a record.
The size of the GPU's video memory directly determines how large the model can be, how fast it can run (affecting the batch size and sequence length), and whether the training process is stable.
So how to evaluate it? It includes the following factors to consider:
1. Model parameters themselves
The most basic video memory usage comes fromModel parameters
itself. This part of the calculation is relatively straight forward:
VRAM_parameters ≈ total number of model parameters × number of bytes required for a single parameter.
FP32: 4 bytes
FP16 / BF16: 2 bytes
INT8: 1 byte
INT8: 1 byte
INT4: 0.5 bytes
Different quantization schemes will compress the model parameters.
Take a Llama 3 8B model with 7 billion parameters as an example. If FP16 is used for loading,
7B × 2 bytes ≈ 14 GB
2. Activations & KV Cache
This is the intermediate calculation result of the model's forward propagation. Its size is strongly related to the batch size, sequence length, model hidden dimension, and number of layers.
When the model generates text (autoregression), in order to speed up the calculation, the Key and Value states of each Transformer layer in the past need to be cached. This part of the video memory consumption is huge and will increase linearly with the sequence length and batch size.
VRAM_KV_Cache (approximately) ∝ 2 × number of layers × hidden dimension × sequence length × batch size × number of unique bytes
When facing model training or SFT scenarios, there are two major memory consumers to consider.
3. Gradients
One isGradients
.
During back-propagation, the system needs to calculate the gradient value for each trainable parameter in order to update the model weights.
VRAM_gradient ≈ the number of trainable parameters × the number of bytes corresponding to the training accuracy
Usually, the accuracy of the gradient is consistent with the accuracy of the model parameters during training. For example, if FP16 is used for training, the gradient also occupies FP16 space.
4. Optimizer States
The second is the optimizer state, which is a big memory consumer during training. Optimizers (such as Adam, AdamW) need to maintain state information (such as momentum, variance) for each trainable parameter.
More importantly, these state values are often stored in FP32 (4-byte) precision, even if the main model is trained using FP16 or BF16. AdamW often requires 2 × 4 = 8 bytes of additional storage for each trainable parameter.
Fully fine-tuned 7B model, this alone may require
7B × 8 bytes = 56 GB
Using an 8-bit optimizer can significantly reduce this.
GPU memory estimation in inference/training scenarios
1. Reasoning
Total inference VRAM ≈ VRAM_parameters + VRAM_activator + VRAM_kv_cache + VRAM_overhead
Take a Llama 3 8B (FP16) inference as an example:
Model parameters:
8B parameters * 2 bytes/parameter = 16 GB
Activation and KV Cache: Highly dependent on sequence length and batch size. For batch size 4, sequence length 2048: Assuming Hidden Dim = 4096, Num Layers = 32, KV Cache (FP16):
2×32×4096×2048×4×2 bytes≈4.3 GB
Overhead: framework, CUDA kernels, estimated 1-2 GB
2. Training
Full fine-tuning
VRAM ≈ VRAM_params + VRAM_gradients + VRAM_optimizer + VRAM_activations + VRAM_overhead
Llama 3 8B (FP16), AdamW (FP32 state)
1. Model parameters (FP16): 8 billion parameters * 2 bytes/parameter = 16 GB
2. Gradients (FP16): 8B parameters * 2 bytes/parameter = 16 GB Optimizer state (AdamW, FP32):
2 states/parameters * 8B parameters * 4 bytes/state = 64 GB Activation value: Very dependent on batch size and sequence length. Probably 10-30 GB or more (highly approximate).
3. Additional overhead: estimated 1-2 GB.
4. Estimated total: 16 + 16 + 64 + (10 to 30) + (1 to 2) ≈ 107 - 128 GB
PEFT fine-tuning
Fine-tuning using techniques such as LoRA can significantly reduce VRAM requirements by freezing the base model parameters and training only small adapter layers.
Llama 3 8B with LoRA (Rank=8, Alpha=16)
16 GB (Base) + ~0.24 GB (LoRA Params/Grads/Optim) + (10 to 30) GB (Activations) + (1 to 2) GB (Overhead) ≈ 27 - 48 GB
Calculator for GPU RAM
There is an APP abroad that has made an online calculator for calculating video memory. You can try it.