How to choose H100/A100? 90% of people ignore the key to selection: GPU memory bandwidth!

Written by
Jasper Cole
Updated on:July-09th-2025
Recommendation

A must-read for deep learning deployment! How does GPU memory bandwidth affect performance?

Core content:
1. The importance of GPU memory bandwidth to the performance of large language models
2. The basic composition of GPUs and the role of memory interfaces
3. Different GPU memory standards (GDDR5/GDDR6/HBM) and their impact

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

Many people ask in Zhihu or forums, "If DeepSeek is deployed locally, how should I choose a GPU?", or "If I deploy DeepSeek R1, how many H100s or A100s do I need at least?". It seems that everyone gives priority to the GPU model, but always ignores one indicator - the GPU's memory bandwidth. This indicator actually plays a key role in the performance of your large language model. In some cases, you even need to choose a GPU based on this indicator. Choosing the right GPU will not only get twice the result with half the effort, but may also save costs.

This article explains what GPU memory bandwidth is, why it matters, and how it affects deep learning workloads. Understanding memory bandwidth can help machine learning teams make informed decisions when choosing GPU servers.

Tips: DigitalOcean's GPU Droplet server is an excellent choice for scalable and high-performance computing. It provides more than a dozen GPUs including A100, H100, H200, including pay-as-you-go cloud instances and bare metal models. If you are interested, you can scan the QR code at the end of the article to contact DigitalOcean's exclusive strategic partner in China, Zhuopu Cloud (aidroplet.com) .


The basic components of a GPU

A graphics card is similar to a motherboard in that it is a printed circuit board that houses the processor, memory, and power management unit. It also contains a BIOS chip that stores the graphics card's settings and performs diagnostics on the memory and input-output components during startup.

The GPU on a graphics card is similar to the CPU on a computer motherboard. However, the GPU is specifically designed to handle the complex mathematical and geometric calculations required for graphics rendering and other machine learning applications.


In a graphics card, the computing unit (GPU) is connected to the memory unit (VRAM, or Video Random Access Memory) via a bus called the memory interface.

In a computer system, there are multiple memory interfaces. The memory interface is the physical bit width of the memory bus associated with the GPU. Data is sent to and transferred from the graphics card memory every clock cycle (billions of times per second). The number of bits that can fit along the bus per clock cycle is the width of this interface, often described as "384 bits" and so on. A 384-bit memory interface allows 384 bits of data to be transferred per clock cycle. Therefore, the memory interface is also an important part of the memory bandwidth calculation when determining the maximum memory throughput on the GPU. Therefore, NVIDIA and AMD prefer to use a standardized serial point-to-point bus in their graphics cards. The POD125 standard is used by the NVIDIA Ampere series of graphics cards A4000, A5000 and A6000, which actually describes the communication protocol with the GDDR6 VRAM.

Another memory bandwidth factor to consider is latency. Initially, general purpose buses like the VMEbus and S-100 bus were implemented, but modern memory buses are designed to connect directly to the VRAM chips to reduce latency.

GDDR5 and GDDR6 are among the latest GPU memory standards. Each memory type consists of two chips, each equipped with a 32-bit bus (composed of two parallel 16-bit buses), which allows multiple memory accesses to occur simultaneously. Therefore, a GPU with a 256-bit memory interface will use eight GDDR6 memory chips.

Another class of memory types is HBM (High Bandwidth Memory) and HBM2. Each HBM interface has a bandwidth of 1024 bits and generally provides higher performance than GDDR5 and GDDR6.

The external PCI-Express connection between the motherboard and the graphics card should not be confused with this internal memory interface. This bus is also characterized by its bandwidth and speed, although it is orders of magnitude slower.

What is GPU memory bandwidth?

The memory bandwidth of a GPU determines how quickly it can move data from memory (VRAM) to the compute cores. It is more representative than the GPU memory speed. It is determined by the data transfer speed between memory and the compute cores and the number of parallel links in the bus between these two parts.

Since the early 1980s, when memory bandwidth for home computers was about 1 MB/s in absolute terms, bandwidth for consumer devices has increased significantly—by orders of magnitude. However, the growth in available computing resources has outpaced the increase in bandwidth. To avoid frequently hitting bandwidth limits, it is critical to ensure that workloads and resources are matched in terms of memory size and bandwidth.

Let’s take a look at one of the most advanced GPUs for ML, the NVIDIA RTX A4000. It comes with 16 GB of GDDR6 memory, a 256-bit memory interface (the number of independent links on the bus between the GPU and VRAM), and an astonishing 6144 CUDA cores. With all these memory-related features, the A4000 can achieve a memory bandwidth of 448 GB/s.

Here are some other popular GPU specs:

GPU
VRAM
Memory interface width
Memory bandwidth
P4000
8 GB GDDR5
256 bits
243 GB/s
P5000
8GB GDDR5X
256 bits
288 GB/s
P6000
24GB GDDR5X
348 bits
432 GB/s
V100
32GB HBM2
4096 bits
900 GB/s
RTX4000
8GB GDDR6
256 bits
416 GB/s
RTX5000
16GB GDDR6
256 bits
448 GB/s
A4000
16GB GDDR6
256 bits
448 GB/s
A5000
24GB GDDR6
348 bits
768 GB/s
A6000
48GB GDDR6
348 bits
768 GB/s
A100
80GB HBM2
5120 bits
1555 GB/s

Why do machine learning applications require high memory bandwidth?

The impact of memory bandwidth may not be immediately obvious. When bandwidth is insufficient, it can create a bottleneck, causing thousands of GPU compute cores to remain idle while waiting for memory data. In addition, depending on the needs of the application, the GPU may need to process a block of data multiple times (let’s say T times). In this case, the external PCI bandwidth should be at least 1/T of the GPU’s internal bandwidth to avoid latency. The most common GPU usage scenarios highlight the limitations mentioned earlier. For example, when training a model, the program loads training data into GDDR memory and passes it through the neural network layers in the compute cores multiple times, typically running for hours. Therefore, the ratio of PCI bus bandwidth to GPU internal bandwidth can be as high as 20 to 1.

The memory bandwidth required depends entirely on the type of project you are working on. For example, if you are working on a deep learning project that relies on large amounts of data being fed, reprocessed, and continuously restored into memory, then you will need a wider bandwidth. For video and image-based machine learning projects, the memory and bandwidth requirements are not as low as for natural language or sound processing projects. For most average projects, 300 GB/s to 500 GB/s is a good reference value. This is not always the case, but it is usually sufficient for a wide range of visual data machine learning applications.

Let’s look at an example of deep learning memory bandwidth requirement verification:

When considering a 50-layer ResNet with over 25 million weight parameters, using 32-bit floating point numbers to store each parameter, we found that it requires about 0.8 GB of memory. Therefore, when parallelizing with a mini-batch size of 32, each model pass requires about 25.6 GB of memory.

For a GPU like the A100, which has a compute power of 19.5 TFLOPs, the ResNet model consumes 497 GFLOPs in a single pass (feature size is 7 x 7 x 2048). The GPU can complete approximately 39 full passes per second, resulting in a bandwidth requirement of 998 GB/s. However, since the A100 has a bandwidth of 1555 GB/s, it can efficiently manage this model without encountering any bottlenecks.

How can I optimize my model to reduce memory bandwidth usage?

In general, machine learning algorithms, especially deep neural networks in the field of computer vision, have large memory and memory bandwidth requirements. There are some techniques that can be used to reduce the cost and time of deploying ML models in resource-constrained environments, even in powerful cloud ML services. Here are some strategies that can be implemented:

Partial Fitting: When a dataset is too large to process in a single pass, the model can be fit in stages instead of processing all the data at once. This approach allows you to take a portion of the data, fit the model to generate a weight vector, and then move on to the next portion of the data. This process is repeated over and over, with each portion of the data contributing to the creation of a new weight vector. Needless to say, this reduces VRAM usage while increasing training time. The most notable drawback is that not all algorithms and implementations use partial fitting or can be technically adapted to do so. Nonetheless, it should be considered whenever possible.

Dimensionality reduction: This is important not only for reducing training time but also for reducing memory consumption at runtime. Some techniques, such as principal component analysis (PCA), linear discriminant analysis (LDA), or matrix factorization, can significantly reduce the dimensionality and produce a subset of input variables with fewer features while retaining some important properties of the original data.

Sparse matrices: When dealing with sparse matrices, storing only the non-zero entries can save a lot of memory. Depending on the number and distribution of non-zero entries, different data structures can be used, which can save a lot of memory compared to the basic approach. The key trade-off is that accessing individual elements becomes more challenging and requires additional auxiliary structures to unambiguously restore the original matrix, which requires higher core computational overhead in exchange for lower memory bandwidth usage.

Some common questions

1. What is  GPU memory bandwidth?

GPU memory bandwidth refers to the rate at which data can be transferred between the GPU and memory (VRAM). It is measured in gigabytes per second (GB/s) and plays a key role in processing large datasets, real-time rendering, and AI/ML workloads. Higher bandwidth allows for faster data transfer, thereby improving overall performance.

2. How to calculate  GPU memory bandwidth?

GPU memory bandwidth is calculated using the following formula:

  • Memory bandwidth = memory bus width × memory speed × data rate
  • Memory Bus Width (in bits): The width of the memory interface, for example, 128 bits, 256 bits, or 512 bits.
  • Memory Speed ​​(in GHz): The clock speed of the memory modules.
  • Data rate: The number of data transfers per clock cycle (for example, GDDR6X memory has a higher data rate than GDDR6).

For example, a GPU with a 256-bit bus, 16 Gbps memory speed, and GDDR6 memory (double data rate 2) will have a memory bandwidth of: 256×16×2/8 = 512 GB/s.

3. Why is  GPU memory bandwidth important?

Memory bandwidth determines how quickly the GPU can access and process data, affecting performance in the following areas:

  • Machine Learning – Faster model training and inference.
  • Gaming - Smoother rendering, especially at high resolutions and high refresh rate settings.
  • Video Editing and 3D Rendering - Load textures and assets faster.
  • Scientific computing – improving data processing for simulations.

GPUs with high memory bandwidth provide smoother performance, especially when processing large datasets.

4. How does memory type affect bandwidth?

Different types of GPU memory provide different bandwidth capabilities:

  • GDDR6 - Commonly found in gaming and workstation GPUs, offers good bandwidth.
  • GDDR6X – used in high-end GPUs like the RTX 3090, provides faster data transfer.
  • HBM (High Bandwidth Memory) – used in AI and datacenter GPUs (e.g. AMD MI300X), provides higher bandwidth due to wider memory bus and stacked architecture.

5. How does memory bandwidth differ from memory size ( VRAM )?

VRAM (memory size) determines how much data can be stored at once. GPUs with more VRAM can handle larger data sets. Memory bandwidth determines how fast data can be transferred. Even if a GPU has a lot of VRAM, it can become a bottleneck if its bandwidth is low. For games, VRAM is critical for high-resolution textures, while for AI/ML, memory bandwidth is often a much bigger performance factor.

6. Can  GPU memory bandwidth be increased?

It is not possible to directly increase memory bandwidth, but you can optimize usage by: Overclocking VRAM (if supported, but with caution). Using optimized algorithms to reduce memory bottlenecks. Using high-bandwidth cloud GPUs, such as DigitalOcean GPU Droplets, which offer H100 GPUs with superior memory performance.

7. Which  GPUs  have the highest memory bandwidth?

Some of the highest bandwidth GPUs include:

  • NVIDIA H100 – 3.35 TB/s (HBM3e memory)
  • AMD Instinct MI300X – 5.3 TB/s (HBM3 memory)
  • NVIDIA A100 – 2.0 TB/s (HBM2e memory)
  • NVIDIA RTX 4090 – 1.0 TB/s (GDDR6X memory)

These GPUs are designed for AI, machine learning, and high-performance computing.

8. How does memory bus width affect bandwidth?

The memory bus width is the number of bits that can be transferred per cycle. A wider bus allows more data to be processed simultaneously, increasing bandwidth. For example:

  • 128-bit bus (RTX 4060) → lower bandwidth.
  • 384-bit bus (RTX 4090) → higher bandwidth.
  • 4096-bit bus (HBM-powered GPU) → provides extreme bandwidth for AI workloads.

9. What role does memory speed (clock speed) play in bandwidth?

Memory clock speed determines how quickly data can be read and written. Higher speeds mean more data transfers per second, increasing bandwidth. However, memory type (e.g. GDDR6 vs. HBM) and bus width affect overall performance.

10. How does GPU memory bandwidth affect game performance?

Higher bandwidth allows for faster loading of textures and smoother frame rates. This is critical for high-resolution (1440p/4K) games that require fast processing of large amounts of textures. This is even more important in open-world games (e.g. Cyberpunk 2077, Microsoft Flight Simulator).

11.  Is memory bandwidth more important than  VRAM for AI  and machine learning  ?

For AI/ML workloads, memory bandwidth is often more important than VRAM size. AI models need to quickly move data between memory and processing cores, so high bandwidth is critical for efficiency. For large models, both bandwidth and VRAM are important, but GPUs like the NVIDIA H100 (3.35 TB/s bandwidth) win because they can efficiently handle massively parallel computations. DigitalOcean GPU Droplets provide high-bandwidth GPUs for cloud-based AI workloads, ensuring faster model training and inference without hardware limitations.

Summarize

Understanding GPU memory bandwidth is crucial for optimizing machine learning models. It largely determines the speed of data transmission, which directly affects model training speed, inference efficiency, and overall computing performance.

For AI teams that are looking for performance and want to accurately control costs , DigitalOcean GPU Droplet servers provide a more suitable solution. With high-bandwidth GPUs such as NVIDIA H100 and H200, you can efficiently train deep learning models, process large data sets, and scale workloads without investing a lot of money in hardware upfront. By leveraging DigitalOcean's cloud-based GPU services, you can optimize costs while ensuring that your models run smoothly and efficiently.