How much GPU memory is needed to run 70B LLM?

Explore the GPU memory requirements for 70B parameter LLM operation and reveal the unique advantages of GPU in AI computing.
Core content:
1. The natural fit between AI computing-intensive tasks and GPU
2. Multi-factor analysis of GPU memory usage
3. The impact of model size and data accuracy on memory
As the title of this article says, have you ever wondered: How much GPU memory is needed to load a 70B LLM? You should have the answer after reading this article.
AI is inherently massive Matrix and
vector
Operations are computationally intensive and require a large amount of memory to store the training parameters of the model.Scalar
The GPU is a coprocessor that takes advantage of SIMD (Single Instruction Multiple Data) and SIMT (Single Instruction Multiple Threads) to the extreme, achieving highly parallel computing units in the architecture.
Simply put, GPU is more suitable for computing-intensive tasks than CPU. And the reasoning and deployment process of LLM happens to be computing tasks. Therefore, GPU is more suitable for running LLM operations than CPU.
There is an interesting fact, the full name of GPU is Graphic Processing Unit, from the name we can see that it was originally designed to handle image rendering. But fate seems to have played a joke, and later ML/AI also grew into a matrix. GPU naturally took over AI/ML and even LLM operations.
Model size
Key-Value Cache
Memory Overhead
Model size
The size of the model itself largely determines the size of the GPU memory required. The larger the model, the more GPU memory is required.
The size of the model is determined by 2 parts:Model parameter quantity
and Data precision type
Model parameter quantity
That is, the number of parameters for model training, the unit is B (Billion). For example, the number of parameters of GPT-3 is 175 Billion .LLaMa-2 13B
The number of parameters is 13 BIllion.
Parameter Data Type
That is, the input parameter data type of the model, which can be float32, float16, or float8, etc. For example, in PyTorch, you can specify the data type in the following ways:
import torch
# Set the data type to float16
torch.set_default_dtype(torch.float16)
# Create a Transformer model instance
model = TransformerModel()
Different data types have different sizes for each parameter:
float32: 1 parameter takes up 4 bytes
float16: 1 parameter occupies 2 bytes
8 bit: 1 parameter only occupies 1 byte
Assuming we use float16 to load the LLaMa-2 13B model, the final memory size of the loaded model is: 13 Billion * 2 byte = 26 GB.
KV Cache (Key-Value Cache) is an optimization technique used by the Transformer model in the autoregressive decoding process, mainly used to improve the inference speed of large models. Key
and Value
Vectors are used to reduce repeated calculations, thereby improving reasoning efficiency. The general idea is actually similar to the space-for-time trade in dynamic programming (DP) .
The calculation formula is as follows:
2 * n_dtype * n_layers * n_hidden_size
Parameter explanation:
2 means that each KV-Cache needs to save 2 bytes to cache the Key and Value respectively.
n_dtype is the parameter data type mentioned above. This parameter ensures that the model uses the correct data type when processing data, thus avoiding potential accuracy issues.
n_layers represents the total number of encoder and decoder layers in Transformer. Each encoder layer and decoder layer contains a self-attention mechanism and a feedforward neural network. By stacking multiple such layers, the depth and expressiveness of the model can be increased.
n_hidden_size refers to the dimension size of the hidden layer . A neural network usually includes an input layer, a hidden layer, and an output layer, and n_hidden_size is used to define the dimension size of the hidden layer.
Taking the LLaMa-2 13B model as an example, the data type is float16, and the KV Cache size of 1 token is:
2 * 2 * 40 * 5120 = 820 KB/token
For LLaMa-2 13B n_layers is 40 and n_hidden_size is 5120, so the final result is 820 KB.
It can be seen that a total of 5 tokens are occupied. Then the output result is as follows:
It can be seen that the large model uses 54 tokens to output the Shanghai weather results.
Therefore, for the query "What's the weather like in Shanghai today?", the large model LLM uses a total of 5 + 54 tokens. Each token requires 800KB, so this query requires a total of 59 * 800KB = 46MB.
Note : In actual scenarios, LLM may receive more Token inputs, such as processing long texts. The maximum number of Tokens that can be set for a single LLaMa-2 13B request is 4096. Therefore, the maximum KV-Cache that LLaMa-2 13B can process at a time is 4096 * 820 KB = 3.2 GB . In addition, as the number of concurrent requests increases, this number will increase exponentially!
During the reasoning process of LLM, there are some fragmented temporary variables. These temporary variables also need to occupy GPU Memory. Therefore, in addition to the model size and KV-Cache, a certain amount of additional memory overhead is required. Generally, the model size + 10% of the maximum value of KV-Cache can be used as the size of the additional memory overhead.
Total GPU Memory = Model Size + KV Cache + Memory Overhead
Model size = 13 Billion * 2 Bytes = 26 GB
Total KV cache = 800 KB * 4096 Tokens * 10 concurrent requests = 32 GB
Memory Overhead = 0.1 * (26 GB + 32 GB) = 5.8 GB
So the total GPU memory required is: 26 GB + 32 GB + 5.8 GB = 63.8 GB. This requires 2 Nvidia A100 chips.
The following two tables describe the GPU memory size required for models of different sizes, based on different numbers of tokens and different numbers of concurrent requests.
Single concurrent request:
10 concurrent requests: