Contact us
Back to

From H100, GH200 to GB200: How NVIDIA Builds AI Supercomputers with SuperPod

Jason
Data Center Architect · Sep 25, 202461440High Performance Computing

As AI models grow larger, training on a single GPU is no longer feasible. The challenge now is interconnecting hundreds or even thousands of GPUs to form a supercomputer that acts as a single, unified system. This has become the industry's main focus. NVIDIA’s DGX SuperPOD, the next-generation AI architecture for data centers, is designed to meet this challenge. It delivers the computational power needed for AI model training, inference, high-performance computing (HPC), and hybrid workloads—boosting predictive accuracy and reducing deployment time. In this article, we’ll explore NVIDIA’s three generations of GPU interconnect architectures: H100, GH200, and GB200.

 

Building a 256-GPU SuperPod Based on H100

In a DGX A100 setup, each node contains 8 GPUs connected via NVLink and NVSwitch within the server, while inter-node connectivity (across different servers) relies on 200Gbps InfiniBand (IB) HDR networking (note: this can also be replaced with RoCE networking).

 

With the DGX H100, NVIDIA extended NVLink beyond intra-node communication, introducing the NVLink-network Switch. Inside each node, NVSwitch handles internal GPU traffic, while the NVLink-network Switch manages inter-node communications. This setup allows for the creation of a SuperPod with up to 256 H100 GPUs. The bandwidth for reduction operations across 256 GPUs can still hit 450 GB/s, consistent with the bandwidth within a single server.

 

NVIDIA-NVLink-Scale-Up

 

However, the DGX H100 SuperPod isn’t without limitations. The connection between DGX H100 nodes is constrained by just 72 NVLink links, meaning the SuperPod's network is not fully non-blocking.

 

As shown in the diagram, within a DGX H100 system, four NVSwitches leave 72 NVLink connections available for inter-node communication via NVLink-network Switches. These 72 NVLink connections offer a total bidirectional bandwidth of 3.6TB/s, while the total bidirectional bandwidth for 8 H100 GPUs is 7.2TB/s, resulting in some level of oversubscription at the NVSwitch level.

 

NVIDIA-NVLink4-NVSwitch-New-Features

 

nvidia_hpc_nvlink256-GPU SuperPod Based on H100

 

Building a 256-GPU SuperPod Based on GH200 and GH200 NVL32

In 2023, NVIDIA announced the production launch of its generative AI engine, the DGX GH200. The GH200 integrates the H200 GPU (the main differences between H200 and H100 are memory size and bandwidth) with the Grace CPU, with one Grace CPU paired with one H200 GPU. Besides NVLink 4.0 connections between GPUs, GH200 also uses NVLink 4.0 to connect the CPU and GPU.

 

NVIDIA-GB200

 

With NVLink 4.0’s 900GB/s bandwidth, GH200 boosts computational power. Internally, servers may use copper connections, while inter-server communication may rely on optical fiber. In a 256-GPU GH200 cluster, each GH200 corresponds to 9 800Gbps optical modules, with each module delivering 100GB/s over two NVLink 4.0 links.

 

The key difference between the DGX GH200 SuperPod and the DGX H100 SuperPod is that both intra-node and inter-node connections use NVLink-network Switches.

 

A DGX GH200 node uses a two-tier Fat-tree architecture, with each node consisting of 8 GH200 GPUs and 3 NVLink-network Switches at the first tier. A fully connected 256-GPU SuperPod requires 36 second-tier NVLink-network Switches to ensure a fully non-blocking network.

 

dgx-gh200-nvlink256-GPU SuperPod Based on GH200

 

The GH200 NVL32 setup, designed for rack-level clusters, features 32 GH200 GPUs and 9 NVSwitch Trays (each tray contains 2 NVSwitch chips). A 256-GPU GH200 NVL32 system would need an additional set of 36 first-tier NVLink-network Switches to form a SuperPod.

nvidia-gh200-nvl32-diagram-front-back

 

Building a 576-GPU SuperPod Based on GB200 NVL72

Unlike the GH200, the GB200 combines one Grace CPU and two Blackwell GPUs (note: each individual Blackwell GPU does not fully match the performance of a single B200 GPU). A GB200 Compute Tray is designed according to NVIDIA’s MGX architecture, with each tray containing two GB200 modules—equivalent to two Grace CPUs and four GPUs.

 

gb200-nvl72-rack-2

 

A GB200 NVL72 node consists of 18 GB200 Compute Trays (36 Grace CPUs and 72 GPUs), along with 9 NVLink-network Switch Trays. Each Blackwell GPU is equipped with 18 NVLink connections, while each NVLink-network Switch Tray features 144 NVLink ports. Therefore, 9 NVLink-network Switch Trays are required to establish full connectivity for the 72 GPUs.

 

Nvidia-NVL72-interconnect-topologyInternal Topology of GB200 NVL72

 

According to NVIDIA's official documentation, eight GB200 NVL72 units can form a SuperPod, resulting in a 576-GPU supercomputing node.

 

However, upon closer inspection, the 9 NVLink-network Switch Trays within the GB200 NVL72 node are entirely occupied with connecting the 72 GB200 modules, leaving no extra NVLink ports to scale into a larger two-tier switch architecture. Based on NVIDIA’s official diagrams, the 576-GPU SuperPod likely achieves inter-node communication through Scale-Out RDMA networks rather than relying on a NVLink-based Scale-Up architecture. To interconnect 576 GPUs using NVLink, each set of 72 GB200s would require 18 additional NVSwitches, which would exceed the physical space of a single rack.

 

NVIDIA also states that the NVL72 offers both single-rack and dual-rack configurations. In the dual-rack version, each Compute Tray is connected to a single GB200 subsystem. This dual-rack version could potentially use NVLink interconnects to support the full 576-GPU SuperPod.

 

Nvidia-SuperPod-with-576-GPUs-based-on-GB200576-GPU SuperPod Based on GB200

 

The GB200 SuperPod, like the fully interconnected 256-GPU H200 architecture, employs a two-tier NVLink-network Switch structure to support its 576 Blackwell GPUs. In the first tier, half of the switch ports are dedicated to connecting all 576 GPUs, requiring a total of 144 NVLink-network Switches. In the second tier, the remaining switch ports are used to interconnect with the switches in the first tier, requiring an additional 72 NVSwitches to complete the network. This two-tier design ensures efficient GPU interconnectivity and scalability.

 

Image source: Nvidia

We use cookies to ensure you get the best experience on our website. Continued use of this website indicates your acceptance of our cookie policy.