How much GPU memory is needed to run 70B LLM?

Written by
Jasper Cole
Updated on:July-17th-2025
Recommendation

Explore the GPU memory requirements for 70B parameter LLM operation and reveal the unique advantages of GPU in AI computing.

Core content:
1. The natural fit between AI computing-intensive tasks and GPU
2. Multi-factor analysis of GPU memory usage
3. The impact of model size and data accuracy on memory

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

As the title of this article says, have you ever wondered: How much GPU memory is needed to load a 70B LLM? You should have the answer after reading this article.

Why GPU, not CPU?


AI is inherently massive Matrix  and vector Operations are computationally intensive and require a large amount of memory to store the training parameters of the model.Scalar The GPU is a coprocessor that takes advantage of SIMD (Single Instruction Multiple Data) and SIMT (Single Instruction Multiple Threads) to the extreme, achieving highly parallel computing units in the architecture.

Simply put, GPU is more suitable for computing-intensive tasks than CPU. And the reasoning and deployment process of LLM happens to be computing tasks. Therefore, GPU is more suitable for running LLM operations than CPU.

There is an interesting fact, the full name of GPU is Graphic Processing Unit, from the name we can see that it was originally designed to handle image rendering. But fate seems to have played a joke, and later ML/AI also grew into a matrix. GPU naturally took over AI/ML and even LLM operations.


How to calculate GPU Memory


When using large models, several factors will affect the size of GPU Memory, mainly including the following factors:
  •   Model size

  •   Key-Value Cache

  •   Memory Overhead


Model size

The size of the model itself largely determines the size of the GPU memory required. The larger the model, the more GPU memory is required.

The size of the model is determined by 2 parts:Model parameter quantity and Data precision type

Model parameter quantity

That is, the number of parameters for model training, the unit is B (Billion). For example,  the number of parameters of GPT-3  is 175 Billion .LLaMa-2 13B The number of parameters is 13 BIllion.

Parameter Data Type

That is, the input parameter data type of the model, which can be float32, float16, or float8, etc. For example, in PyTorch, you can specify the data type in the following ways:

import  torch
# Set the data type to float16torch.set_default_dtype(torch.float16)
# Create a Transformer model instancemodel = TransformerModel()

Different data types have different sizes for each parameter:

  • float32: 1 parameter takes up 4 bytes

  • float16: 1 parameter occupies 2 bytes

  • 8 bit: 1 parameter only occupies 1 byte

Assuming we use float16 to load the LLaMa-2 13B model, the final memory size of the loaded model is: 13 Billion * 2 byte = 26 GB.

Key-Value Cache

KV Cache (Key-Value Cache) is an optimization technique used by the Transformer model in the autoregressive decoding process, mainly used to improve the inference speed of large models. Key  and  ValueVectors are used to reduce repeated calculations, thereby improving reasoning efficiency. The general idea is actually similar to the space-for-time trade  in dynamic programming (DP) .

The calculation formula is as follows:

2 * n_dtype * n_layers * n_hidden_size

Parameter explanation:

  • 2  means that each KV-Cache needs to save 2 bytes to cache the Key and Value respectively.

  • n_dtype is the parameter data type mentioned above. This parameter ensures that the model uses the correct data type when processing data, thus avoiding potential accuracy issues.

  • n_layers represents the total number of encoder and decoder layers in Transformer. Each encoder layer and decoder layer contains a self-attention mechanism and a feedforward neural network. By stacking multiple such layers, the depth and expressiveness of the model can be increased.

  • n_hidden_size ‌refers to the dimension size of the hidden layer‌ . A neural network usually includes an input layer, a hidden layer, and an output layer, and n_hidden_size is used to define the dimension size of the hidden layer.

Taking  the LLaMa-2 13B  model as an example, the data type is float16, and the KV Cache size of 1 token is:

2 * 2 * 40 * 5120 = 820 KB/token

For LLaMa-2 13B  n_layers is 40 and n_hidden_size is 5120, so the final result is 820 KB.

800 KB may not seem like a lot, but this is just the usage of one token. In actual large-model applications, the user's input and the model's output often require tens of thousands of tokens to be inferred.
For example, use the "Token Calculation Website" to calculate how many tokens will be used for the sentence "What's the weather like in Shanghai today?" The result is as follows:

It can be seen that a total of 5 tokens are occupied. Then the output result is as follows:

It can be seen that the large model uses 54 tokens to output the Shanghai weather results.

Therefore, for the query "What's the weather like in Shanghai today?", the large model LLM uses a total of 5 + 54 tokens. Each token requires 800KB, so this query requires a total of 59 * 800KB = 46MB.

Note : In actual scenarios, LLM may receive more Token inputs, such as processing long texts.  The maximum number of Tokens that can be set for a single LLaMa-2 13B  request is 4096. Therefore,  the maximum KV-Cache that LLaMa-2 13B  can process at a time is  4096  *  820 KB = 3.2 GB . In addition, as the number of concurrent requests increases, this number will increase exponentially!

Memory Overhead

During the reasoning process of LLM, there are some fragmented temporary variables. These temporary variables also need to occupy GPU Memory. Therefore, in addition to the model size and KV-Cache, a certain amount of additional memory overhead is required. Generally, the model size + 10% of the maximum value of KV-Cache can be used as the size of the additional memory overhead.


GPU Memory Total Calculation Formula


To calculate the GPU Memory required for all large models during use, all the above mentioned factors need to be taken into account. The following formula is a complete calculation formula:
Total GPU Memory = Model Size + KV Cache + Memory Overhead
Finally, let's take LLaMa-2 13B as an example. Assume there are 10 concurrent requests, and LLaMa-2 13B is requested to perform model inference with the maximum number of tokens (4096). The final calculation process of GPU Memory required is as follows:
  1. Model size = 13 Billion * 2 Bytes = 26 GB

  2. Total KV cache = 800 KB * 4096 Tokens * 10 concurrent requests = 32 GB

  3. Memory Overhead = 0.1 * (26 GB + 32 GB) = 5.8 GB

So the total GPU memory required is:  26 GB + 32 GB + 5.8 GB = 63.8 GB. This requires 2 Nvidia A100 chips.

GPU Memory for Common Large Models

The following two tables describe the GPU memory size required for models of different sizes, based on different numbers of tokens and different numbers of concurrent requests.

Single concurrent request:

10 concurrent requests:

It can be seen that with the increase in the number of concurrent requests, the number of tokens, and the size of large models, the growth of GPU Memory is terrifying, and the cost of hardware computing power is too high.