Woter AI detection.Hurry - ends Jul 22nd

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

How much GPU memory is needed to run 70B LLM?

Written by

Jasper Cole

Updated on:July-17th-2025

As the title of this article says, have you ever wondered: How much GPU memory is needed to load a 70B LLM? You should have the answer after reading this article.

Why GPU, not CPU?

AI is inherently massive Matrix and vector Operations are computationally intensive and require a large amount of memory to store the training parameters of the model.Scalar The GPU is a coprocessor that takes advantage of SIMD (Single Instruction Multiple Data) and SIMT (Single Instruction Multiple Threads) to the extreme, achieving highly parallel computing units in the architecture.

Simply put, GPU is more suitable for computing-intensive tasks than CPU. And the reasoning and deployment process of LLM happens to be computing tasks. Therefore, GPU is more suitable for running LLM operations than CPU.

There is an interesting fact, the full name of GPU is Graphic Processing Unit, from the name we can see that it was originally designed to handle image rendering. But fate seems to have played a joke, and later ML/AI also grew into a matrix. GPU naturally took over AI/ML and even LLM operations.

How to calculate GPU Memory

When using large models, several factors will affect the size of GPU Memory, mainly including the following factors:

Model size
Key-Value Cache
Memory Overhead

Model size

The size of the model itself largely determines the size of the GPU memory required. The larger the model, the more GPU memory is required.

The size of the model is determined by 2 parts:Model parameter quantity and Data precision type

Model parameter quantity

That is, the number of parameters for model training, the unit is B (Billion). For example, the number of parameters of GPT-3 is 175 Billion .LLaMa-2 13B The number of parameters is 13 BIllion.

Parameter Data Type

That is, the input parameter data type of the model, which can be float32, float16, or float8, etc. For example, in PyTorch, you can specify the data type in the following ways:

import  torch
# Set the data type to float16torch.set_default_dtype(torch.float16)
# Create a Transformer model instancemodel = TransformerModel()

Different data types have different sizes for each parameter:

float32: 1 parameter takes up 4 bytes
float16: 1 parameter occupies 2 bytes
8 bit: 1 parameter only occupies 1 byte

Assuming we use float16 to load the LLaMa-2 13B model, the final memory size of the loaded model is: 13 Billion * 2 byte = 26 GB.

Key-Value Cache

KV Cache (Key-Value Cache) is an optimization technique used by the Transformer model in the autoregressive decoding process, mainly used to improve the inference speed of large models. Key and ValueVectors are used to reduce repeated calculations, thereby improving reasoning efficiency. The general idea is actually similar to the space-for-time trade in dynamic programming (DP) .

The calculation formula is as follows:

2 * n_dtype * n_layers * n_hidden_size

Parameter explanation:

2 means that each KV-Cache needs to save 2 bytes to cache the Key and Value respectively.
n_dtype is the parameter data type mentioned above. This parameter ensures that the model uses the correct data type when processing data, thus avoiding potential accuracy issues.
n_layers represents the total number of encoder and decoder layers in Transformer. Each encoder layer and decoder layer contains a self-attention mechanism and a feedforward neural network. By stacking multiple such layers, the depth and expressiveness of the model can be increased.
n_hidden_size ‌refers to the dimension size of the hidden layer‌ . A neural network usually includes an input layer, a hidden layer, and an output layer, and n_hidden_size is used to define the dimension size of the hidden layer.

Taking the LLaMa-2 13B model as an example, the data type is float16, and the KV Cache size of 1 token is:

2 * 2 * 40 * 5120 = 820 KB/token

For LLaMa-2 13B n_layers is 40 and n_hidden_size is 5120, so the final result is 820 KB.

800 KB may not seem like a lot, but this is just the usage of one token. In actual large-model applications, the user's input and the model's output often require tens of thousands of tokens to be inferred.

For example, use the "Token Calculation Website" to calculate how many tokens will be used for the sentence "What's the weather like in Shanghai today?" The result is as follows:

It can be seen that a total of 5 tokens are occupied. Then the output result is as follows:

It can be seen that the large model uses 54 tokens to output the Shanghai weather results.

Therefore, for the query "What's the weather like in Shanghai today?", the large model LLM uses a total of 5 + 54 tokens. Each token requires 800KB, so this query requires a total of 59 * 800KB = 46MB.

Note : In actual scenarios, LLM may receive more Token inputs, such as processing long texts. The maximum number of Tokens that can be set for a single LLaMa-2 13B request is 4096. Therefore, the maximum KV-Cache that LLaMa-2 13B can process at a time is 4096 * 820 KB = 3.2 GB . In addition, as the number of concurrent requests increases, this number will increase exponentially!

Memory Overhead

During the reasoning process of LLM, there are some fragmented temporary variables. These temporary variables also need to occupy GPU Memory. Therefore, in addition to the model size and KV-Cache, a certain amount of additional memory overhead is required. Generally, the model size + 10% of the maximum value of KV-Cache can be used as the size of the additional memory overhead.

GPU Memory Total Calculation Formula

To calculate the GPU Memory required for all large models during use, all the above mentioned factors need to be taken into account. The following formula is a complete calculation formula:

Total GPU Memory = Model Size + KV Cache + Memory Overhead

Finally, let's take LLaMa-2 13B as an example. Assume there are 10 concurrent requests, and LLaMa-2 13B is requested to perform model inference with the maximum number of tokens (4096). The final calculation process of GPU Memory required is as follows:

Model size = 13 Billion * 2 Bytes = 26 GB
Total KV cache = 800 KB * 4096 Tokens * 10 concurrent requests = 32 GB
Memory Overhead = 0.1 * (26 GB + 32 GB) = 5.8 GB

So the total GPU memory required is: 26 GB + 32 GB + 5.8 GB = 63.8 GB. This requires 2 Nvidia A100 chips.

GPU Memory for Common Large Models

The following two tables describe the GPU memory size required for models of different sizes, based on different numbers of tokens and different numbers of concurrent requests.

Single concurrent request:

10 concurrent requests:

It can be seen that with the increase in the number of concurrent requests, the number of tokens, and the size of large models, the growth of GPU Memory is terrifying, and the cost of hardware computing power is too high.