What is the difference between H200 and H100?

Written by

Clara Bennett

Updated on:July-15th-2025

I wonder if you have such doubts. Currently, H100 is the main supply model of NVIDIA Hopper architecture, and at the GTC conference in March this year, Huang Renxun also released the next-generation Blackwell architecture B100 GPU. Why is the H200 with the same architecture launched at this time?

In a nutshell: for reasoning!

Compared with H100, the most important upgrade of H200 is video memory. Thanks to 141GB of HBM3e video memory, the H200 video memory bandwidth reaches 4.8TB/s, making the inference speed of H200 twice that of H100. .()8TB/s of memory bandwidth

Let 's first look at the parameter comparison between H200 and H100:

It can be clearly seen from the figure that the computing parameters of H200 and H100 are exactly the same , and even the power is the same. The only difference is in the GPU memory. The GPU memory used by H200 is HBM3e, while that of H100 is HBM3. This results in the GPU memory of H200 directly increasing from 80GB to 141GB, almost doubling, and the memory bandwidth increased from 3.35TB/s to 4.8TB/s, which is 1.4 times that of H100.

In the evolving field of artificial intelligence, enterprises rely on large language models to meet various inference needs. When deploying inference servers on a large scale, LLM enterprises need to achieve the highest throughput with the lowest TCO. From the official test report provided by NVIDIA, it can be seen that the inference performance of H200 has been greatly improved, and under the Llama2 70B model, it can reach twice that of H100.

Memory bandwidth is critical for high-performance computing (HPC) applications , speeding up data transfer and reducing bottlenecks in complex data processing. For memory-intensive HPC applications such as simulation, scientific research, and artificial intelligence, the H200's higher memory bandwidth ensures data can be accessed and manipulated efficiently, resulting in 110 times faster time to results.

At the same time, NVIDIA also released the chip parameters of GH200 (H200+Grace CPU) . Let's take a look at the overall architecture.

The NVIDIA Grace Hopper architecture combines the groundbreaking performance of the NVIDIA Hopper GPU with the versatility of the NVIDIA Grace CPU in a single superchip, connected by the high-bandwidth, memory-coherent NVIDIA NVLink chip-to-chip (C2C) interconnect.

NVIDIA NVLink-C2C is a memory-consistent, high-bandwidth, low-latency interconnect technology for superchips. NVLink-C2C provides up to 900GB/s of total bandwidth between the CPU and GPU, which is 7 times the PCIe Gen5 channel commonly used in acceleration systems. NVLink-C2C enables applications to use the GPU's video memory and directly use the Grace CPU's memory with high bandwidth.

Each GH200 Grace Hopper superchip has up to 480GB of LPDDR5X CPU memory. GH200 can be easily deployed in standard servers to run a variety of inference, data analysis, and other compute and memory intensive workloads. GH200 can also be used in conjunction with the NVIDIA NVLink switch system, with all GPU threads running on up to 256 NVLink-connected GPUs.

Grace CPU: Currently, NVIDIA Grace CPU is the fastest Arm data center CPU in the world. Grace CPU is designed to achieve high single-threaded performance, high memory bandwidth, and excellent data movement capabilities. NVIDIA Grace CPU combines 72 Neoverse V2 Armv9 cores and up to 480GB of server-grade LPDDR5X memory with ECC (error correction code). This design achieves the best balance between bandwidth, energy efficiency, capacity, and cost.

NVLink-C2C Memory Consistency: Memory consistencyimproves developer productivity, performance, and the amount of GPU-accessible memory. CPU and GPU threads can simultaneously and transparently access CPU and GPU resident memory, allowing developers to focus on algorithms rather than explicit memory management. Memory consistency lets developers transfer only the data they need, rather than migrating entire pages back and forth to the GPU. It also provides lightweight synchronization primitives between GPU and CPU threads by allowing native atomic operations on the CPU and GPU.

For AI inference workloads, the GH200 Grace Hopper superchip combined with NVIDIA networking technology provides the best TCO (total cost of ownership) for scaling solutions, allowing customers to use up to 624GB of fast-access memory to handle larger data sets, more complex models, and new workloads.

NVIDIA GH200 also offers a dual GH200 configuration with two Grace Hopper superchips fully connected via NVLink, providing 288GB of HBM3e and 1.2TB of fast memory for compute and memory-intensive workloads.