Do you have enough GPU memory to handle large models? Learn how to estimate it in one article

Written by
Jasper Cole
Updated on:June-21st-2025
Recommendation

Master GPU memory estimation techniques to efficiently deploy large models.

Core content:
1. Analysis of the relationship between GPU memory and model deployment
2. Key factors for estimating memory requirements
3. Memory estimation methods for inference and training scenarios

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

When working on daily projects, there is also an urgent need for the private deployment of large models. I spent some time to study in depth the relationship between model usage and GPU graphics card configuration and make a record.

The size of the GPU's video memory directly determines how large the model can be, how fast it can run (affecting the batch size and sequence length), and whether the training process is stable.

So how to evaluate it? It includes the following factors to consider:

1. Model parameters themselves

The most basic video memory usage comes fromModel parametersitself. This part of the calculation is relatively straight forward:

VRAM_parameters ≈ total number of model parameters × number of bytes required for a single parameter.

FP32: 4 bytes 

FP16 / BF16: 2 bytes 

INT8: 1 byte 

INT8: 1 byte 

INT4: 0.5 bytes

Different quantization schemes will compress the model parameters.

Take a Llama 3 8B model with 7 billion parameters as an example. If FP16 is used for loading,

7B × 2 bytes ≈ 14 GB

2. Activations & KV Cache

This is the intermediate calculation result of the model's forward propagation. Its size is strongly related to the batch size, sequence length, model hidden dimension, and number of layers.

When the model generates text (autoregression), in order to speed up the calculation, the Key and Value states of each Transformer layer in the past need to be cached. This part of the video memory consumption is huge and will increase linearly with the sequence length and batch size.

VRAM_KV_Cache (approximately) ∝ 2 × number of layers × hidden dimension × sequence length × batch size × number of unique bytes

When facing model training or SFT scenarios, there are two major memory consumers to consider.

3. Gradients

One isGradients.

During back-propagation, the system needs to calculate the gradient value for each trainable parameter in order to update the model weights.

VRAM_gradient ≈ the number of trainable parameters × the number of bytes corresponding to the training accuracy

Usually, the accuracy of the gradient is consistent with the accuracy of the model parameters during training. For example, if FP16 is used for training, the gradient also occupies FP16 space.

4. Optimizer States

The second is the optimizer state, which is a big memory consumer during training. Optimizers (such as Adam, AdamW) need to maintain state information (such as momentum, variance) for each trainable parameter.

More importantly, these state values ​​are often stored in FP32 (4-byte) precision, even if the main model is trained using FP16 or BF16. AdamW often requires 2 × 4 = 8 bytes of additional storage for each trainable parameter.

Fully fine-tuned 7B model, this alone may require

7B × 8 bytes = 56 GB

Using an 8-bit optimizer can significantly reduce this.

GPU memory estimation in inference/training scenarios

1. Reasoning

Total inference VRAM ≈ VRAM_parameters + VRAM_activator + VRAM_kv_cache + VRAM_overhead

Take a Llama 3 8B (FP16) inference as an example:

Model parameters:8B parameters * 2 bytes/parameter = 16 GB

Activation and KV Cache: Highly dependent on sequence length and batch size. For batch size 4, sequence length 2048: Assuming Hidden Dim = 4096, Num Layers = 32, KV Cache (FP16):

2×32×4096×2048×4×2 bytes≈4.3 GB

Overhead: framework, CUDA kernels, estimated 1-2 GB 

2. Training

Full fine-tuning

VRAM ≈ VRAM_params + VRAM_gradients + VRAM_optimizer + VRAM_activations + VRAM_overhead

Llama 3 8B (FP16), AdamW (FP32 state)

1. Model parameters (FP16): 8 billion parameters * 2 bytes/parameter = 16 GB

2. Gradients (FP16): 8B parameters * 2 bytes/parameter = 16 GB Optimizer state (AdamW, FP32):

2 states/parameters * 8B parameters * 4 bytes/state = 64 GB Activation value: Very dependent on batch size and sequence length. Probably 10-30 GB or more (highly approximate). 

3. Additional overhead: estimated 1-2 GB. 

4. Estimated total: 16 + 16 + 64 + (10 to 30) + (1 to 2) ≈ 107 - 128 GB

PEFT fine-tuning

Fine-tuning using techniques such as LoRA can significantly reduce VRAM requirements by freezing the base model parameters and training only small adapter layers.

Llama 3 8B with LoRA (Rank=8, Alpha=16)

1. Base model parameters (frozen, e.g., FP16): 16 GB
2. LoRA parameters (trainable, BF16) : usually very small, for example, about 10 million to 50 million parameters. Assume 20 million parameters * 2 bytes/parameter ≈ 40 MB (negligible relative to the base model). 
3. LoRA gradients (BF16): 20M parameters * 2 bytes/parameter ≈ 40 MB. 4. LoRA optimizer state (AdamW, FP32) : 2 * 20M parameters * 4 bytes/state ≈ 160 MB.
5. Activations:  Still significant, similar to inference, but computed for the full model during the forward/backward pass through the adapter. Estimated 10-30 GB (depending on batch size/sequence length).
6. Overhead:  1-2 GB.
7. Total RAM (LoRA)  :16 GB (Base) + ~0.24 GB (LoRA Params/Grads/Optim) + (10 to 30) GB (Activations) + (1 to 2) GB (Overhead) ≈ 27 - 48 GB

Calculator for GPU RAM

There is an APP abroad that has made an online calculator for calculating video memory. You can try it.