NVIDIA’s DGX H100 NVL256 supercomputing cluster, initially planned to integrate 256 NVIDIA H100 GPUs, has been conspicuously absent from the commercial market. This absence has sparked widespread discussion about the reasons behind its discontinuation. The prevailing opinion is that the main obstacle was the disproportionate cost-benefit ratio. The extensive use of fiber optics for GPU interconnects significantly increased the Bill of Materials (BoM) costs, making it economically unfeasible compared to the standard NVL8 configuration.
DGX H100 NVL256 SuperPOD
NVIDIA claimed that the expanded NVL256 could provide up to twice the throughput for 400B MoE training. However, major clients, after performing cost-benefit analyses, expressed skepticism about this claim. Despite the upcoming NDR InfiniBand reaching speeds of 400 Gbit/s and NVLink4 achieving 450 GB/s—offering approximately a 9x peak bandwidth improvement—the system design, incorporating 128 L1 NVSwitches and 36 L2 external NVSwitches, resulted in a 2:1 blocking ratio. This meant each server could only use half the bandwidth to connect to another server. NVIDIA relied on NVLink SHARP technology to optimize the network and achieve all-to-all bandwidth equivalence.
H100 NVL256 Cost Analysis
At the Hot Chips 34 conference, an analysis of the H100 NVL256 BoM revealed that scaling NVLink256 increased the BoM cost per super unit (SU) by about 30%. Expanding beyond 2048 H100 GPUs necessitated transitioning from a two-tier InfiniBand network topology to a three-tier one, slightly reducing the percentage increase in InfiniBand costs.
After conducting performance/total cost of ownership (perf/TCO) analyses, large-scale customers concluded that spending an additional 30% to buy more HGX H100 servers yielded a better performance/cost ratio than paying for the NVL256 NVLink extension. This led NVIDIA to ultimately decide against launching the DGX H100 NVL256 product.
GH200 NVL32 Redesign
Subsequently, NVIDIA redesigned the NVL256, downsizing it to NVL32 and adopting a copper backplane spine similar to their NVL36/NVL72 Blackwell design. Reportedly, AWS agreed to purchase 16k GH200 NVL32 units for its Project Ceiba. The cost premium for this redesigned NVL32 is estimated to be 10% higher than the standard high-end HGX H100 BoM. With growing workloads, NVIDIA claims that for GPT-3 175B and 16k GH200, the NVL32 will be 1.7 times faster than the 16k H100 and twice as fast in 500B LLM inference. These performance/cost ratios are more attractive to customers, increasing the likelihood of adoption.
GB200 NVL72 Breakthrough
NVIDIA, having learned from the H100 NVL256's failure, shifted to using copper interconnects called “NVLink spine” to address cost issues. This design change is expected to lower the cost of goods (COG) and pave the way for the GB200 NVL72's success. NVIDIA claims that using copper, NVL72 costs will be about six times lower, with each GB200 NVL72 rack saving approximately 20kW of power and each GB200 NVL32 rack saving about 10kW.
Unlike the H100 NVL256, the GB200 NVL72 will not use any NVLink switches within the compute nodes, instead opting for a flat rail-optimized network topology. For every 72 GB200 GPUs, there will be 18 NVLink switches. All connections within the same rack will span only 19U (0.83 meters), achievable within the range of active copper cables.
According to Semianalysis, NVIDIA claims its design can support connecting up to 576 GB200 GPUs within a single NVLink domain, potentially through adding extra NVLink switch layers. NVIDIA is expected to maintain a 2:1 blocking ratio, using 144 L1 NVLink switches and 36 L2 NVLink switches within the GB NVL576 SU. Alternatively, they might adopt a more aggressive 1:4 blocking ratio using only 18 L2 NVLink switches. They will continue using optical OSFP transceivers to extend connections from rack-internal L1 NVLink switches to L2 NVLink switches.
Rumors suggest that NVL36 and NVL72 already account for over 20% of NVIDIA Blackwell deliveries. However, whether major customers will opt for the more costly NVL576 remains uncertain, as scaling to NVL576 requires additional optical devices. NVIDIA appears to have learned that copper interconnects are significantly less expensive than optical devices.
Perspectives on Copper vs. Optical Interconnects
NADDOD, a leader in optical networking solutions for HPC, data centers, and AI, concludes that copper interconnects will dominate at the rack scale level, maximizing Moore’s Law scaling before the necessity of using optics arises. NADDOD views the NVL72 favorably, considering it manifests Moore’s Law at the rack level. They see NVLink domains based on passive copper cables as a new benchmark of success, offering better cost-efficiency ratios and making GB200 NVL72 racks a more attractive investment than just B200s.
In short-range communication scenarios, copper interconnects offer significant advantages, playing a crucial role in high-speed data center interconnects. Amid rising energy consumption and construction costs, copper interconnects provide superior cooling efficiency, lower power consumption, and cost-effectiveness. As SerDes rates upgrade from 56G to 224G, single-port speeds are expected to reach 1.6T, significantly reducing high-speed transmission costs, with copper cable rates also advancing to 224Gbps. Technologies like AEC and ACC enhance transmission distance by embedding signal enhancement chips, with copper module manufacturing processes simultaneously upgrading. Copper interconnects, with their advantages in power consumption, stability, and cost-efficiency, are becoming increasingly important in high-performance computing and data center environments.
DAC’s Increasing Importance and Benefits
According to LightCounting, the global markets for passive direct attach cables (DAC) and active electrical cables (AEC) are projected to grow at annual compound growth rates of 25% and 45%, respectively. Between 2010 and 2022, switch chip bandwidth capacity increased from 640 Gbps to 51.2 Tbps, an 80-fold bandwidth growth, leading to a 22-fold increase in total system power consumption, with optical components showing a 26-fold increase.
Using DAC copper cables eliminates the need for optical infrastructure, simplifying installation and usage, thus reducing costs and complexity and facilitating rapid deployment. DACs are energy-efficient, with power consumption under 0.1W for passive copper DACs and below 5W for AECs, making them suitable for energy-efficient data centers.
Benefits of Using DACs in AI Network Clusters:
- Power Consumption: Compared to fiber optics, DAC copper cables run cooler and consume less power. For instance, the Quantum-2 IB backbone switch consumes 747W with DAC cables, while using multimode optical transceivers increases power consumption to 1,500W.
- Stability: Copper cables offer higher stability than optical modules, reducing jitter and failures, which are common issues with fiber optic high-speed interconnects.
- Latency: DAC copper cables have lower transmission latency than optical modules, avoiding delays from electrical-to-optical and optical-to-electrical conversions.
- Fault Risk: DAC copper cables are more robust, reducing failure risks in high-density data center environments due to bending or physical damage.
NADDOD's Networking Solutions - DAC + Single Mode Modules: The Most Cost-Effective Option
NADDOD offers three different schemes for constructing 128 HGX H100 clusters, each equipped with 8 H100 GPUs and 8 400G network cards, using a 2-layer Fat Tree architecture with non-blocking configurations. These configurations include one multimode solution and two single mode solutions with DACs. Here’s a comparative cost analysis for each scheme:
- Plan 1: Multimode using 800G OSFP 2xSR4 and 400G OSFP SR4
Equipment |
NDD Model Number |
NVIDIA Model Number |
Quantity |
Spine Switch Side Module |
MMA4Z00-NS |
512 |
|
Leaf Switch Side Module |
MMA4Z00-NS |
1024 |
|
Server Network Card Side Module |
MMA4Z00-NS400 |
1024 |
|
Patch Cord |
MFP7E10-Nxxx |
2048 |
|
Switch |
MQM9790-NS2F |
MQM9790-NS2F |
48 |
Network Card |
MCX75310AAS-NEAT |
MCX75310AAS-NEAT |
1024 |
8-Card Server |
/ |
/ |
128 |
- Plan 2: Single Mode using 800G OSFP 2xDR4, 400G OSFP DR4, and 800G OSFP DAC
This plan involves placing the leaf and spine switches within a single rack, reducing the distance between the spine and leaf switches, and allowing for efficient copper cabling.
Equipment |
NDD Model Number |
NVIDIA Model Number |
Quantity |
Spine Switch to Leaf Switch |
MCP4Y10-N002 |
512 |
|
Leaf Switch Side Module |
MMS4X00-NS |
512 |
|
Server Network Card Side Module |
MMS4X00-NS400 |
1024 |
|
Patch Cord |
/ |
1024 |
|
Switch |
MQM9790-NS2F |
MQM9790-NS2F |
48 |
Network Card |
MCX75310AAS-NEAT |
MCX75310AAS-NEAT |
1024 |
8-Card Server |
/ |
/ |
128 |
- plan 3: Single Mode using 800G OSFP 2xFR4 and 800G OSFP Split DAC
In this configuration, the leaf switches are evenly distributed to minimize the distance between the leaf switches and servers, further enhancing cabling efficiency.
Equipment |
NDD Model Number |
NVIDIA Model Number |
Quantity |
Spine Switch Side Module |
MMS4X50-NM |
512 |
|
Leaf Switch Side Module |
MMS4X50-NM |
512 |
|
Leaf Switch to Sever Network Card |
MCP7Y00-N003 |
512 |
|
Patch Cord |
/ |
1024 |
|
Switch |
MQM9790-NS2F |
MQM9790-NS2F |
48 |
Network Card |
MCX75310AAS-NEAT |
MCX75310AAS-NEAT |
1024 |
8-Card Server |
/ |
/ |
128 |
Cost Analysis:
- Traditional Multimode Plan (800G 2xSR4/400G SR4):
-
- Requires 2,560 multimode optical modules.
- Total cost: ~$3 million (NADDOD).
- Single Mode Plan (800G 2xDR4/400G DR4 + 800G DAC):
- Requires 1,536 single-mode optical modules and 512 DAC cables.
- Total cost: ~$1.9 million (NADDOD).
-
Single Mode Plan (800G 2xFR4 + 800G Split DAC):
- Requires 1,024 single-mode optical modules and 512 DAC cables.
- Total cost: ~$2 million (NADDOD).
For 128 HGX H100 clusters, NADDOD’s single-mode solutions offer substantial cost savings, reducing total costs by 33%-36%. For larger configurations, such as 512 HGX H100 clusters, the 800G 2DR4/400G DR4 single-mode and 800G DAC solution offers even greater savings. NADDOD’s approach saves 26% in costs compared to traditional multimode options.
For more information on cost-effective AI cluster connecting solutions, contact our networking architecture experts at NADDOD.
- 18 Tips on Choosing the Right Optical Transceiver
- 2High Speed Copper Cables Applied in Data Centers
- 3The Evolution of 400G, 800G, and 1.6T Optical Modules
- 4Focusing AI Frontiers: NADDOD Unveils 1.6T InfiniBand XDR Silicon Photonics Transceiver
- 5NADDOD Leads in Compatibility and Performance on Thor2 & CX7