Since OpenAI introduced ChatGPT, Large Language Models (LLMs) have quickly become a focal point of interest and development. Companies are investing heavily in LLM pre-training to keep up with this trend. Training a 100B-scale LLM typically requires extensive computational resources, such as clusters with thousands of GPUs. For instance, the Falcon series models were trained on a cluster of 4096 A100 GPUs, taking nearly 70 days to train a 180B model with 3.5T tokens. As data scales continue to grow, so does the demand for computational power. Meta, for example, used 15T tokens to train its LLaMA3 series models on two clusters of 24K H100 GPUs.
This article delves into the components and configurations involved in building large-scale GPU clusters, including various GPU types, server configurations, and network equipment (such as network cards, switches, and optical transceiver modules).
Constructing a GPU cluster with over ten thousand cards is highly complex. This article covers a fraction of the considerations. In practice, cluster construction also involves the design of data center network topologies (like 3-Tier and Fat-Tree), storage networks, management networks, and other aspects, which have similar connection methods and are not elaborated on here.
GPU
The table below highlights the most powerful GPUs from the Ampere, Hopper, and the latest Blackwell series, showcasing improvements in memory, computing power, and NVLink capabilities.
From A100 to H100: FP16 dense computing power has increased more than threefold, while power consumption has only risen from 400W to 700W.
From H200 to B200: FP16 dense computing power has more than doubled, with power consumption increasing from 700W to 1000W.
B200 FP16 dense computing power: Approximately seven times that of the A100, with power consumption only 2.5 times higher.
Blackwell GPUs: Support FP4 precision, offering double the computing power of FP8. NVIDIA reports compare FP4 computing power with Hopper architecture's FP8, highlighting significant acceleration.
It is important to note that the GB200 uses the Full B200 chip, while the B100 and B200 are corresponding cut-down versions.
HGX
HGX is a high-performance server from NVIDIA, typically containing 8 or 4 GPUs per machine, paired with Intel or AMD CPUs, and using NVLink and NVSwitch for full interconnectivity (8 GPUs being the NVLink full interconnect limit outside of NVL and SuperPod). These servers generally use air cooling.
From HGX A100 to HGX H100 and HGX H200: FP16 dense computing power has increased by 3.3 times, while power consumption is less than double.
From HGX H100 and HGX H200 to HGX B100 and HGX B200: FP16 dense computing power has doubled, with comparable power consumption, increasing by at most 50%.
It should be noted that HGX B100 and HGX B200's networks have not significantly upgraded, with rear IB network cards still being 8x400Gb/s.
HGX is a high-performance server from NVIDIA, typically containing 8 or 4 GPUs per machine, paired with Intel or AMD CPUs, and using NVLink and NVSwitch for full interconnectivity. These servers generally use air cooling.
From HGX A100 to HGX H100 and HGX H200: FP16 dense computing power has increased by 3.3 times, while power consumption is less than double.
From HGX H100 and HGX H200 to HGX B100 and HGX B200: FP16 dense computing power has doubled, with comparable power consumption, increasing by at most 50%.
It should be noted that HGX B100 and HGX B200's networks have not significantly upgraded, with rear IB network cards still being 8x400Gb/s.
NVIDIA DGX and HGX are two high-performance solutions designed for deep learning, artificial intelligence, and large-scale computing needs, but they differ in design and target applications:
DGX: Primarily for general consumers, offering a plug-and-play high-performance solution with comprehensive software support, including NVIDIA's deep learning software stack, drivers, and tools. These systems are usually pre-built and closed.
HGX: Mainly for cloud service providers and large-scale data center operators, suitable for building custom high-performance solutions. It offers a modular design, allowing customers to customize hardware based on their needs, typically provided as a hardware platform or reference architecture.
Networking
Network Cards
Here we primarily introduce ConnectX-5/6/7/8, Mellanox's high-speed network cards, supporting both Ethernet and IB (InfiniBand). ConnectX-5 was released in 2016, NVIDIA acquired Mellanox in 2019, then released ConnectX-6 in 2020, ConnectX-7 in 2022, and at the 2024 GTC conference, ConnectX-8 was introduced. The detailed specifications are yet to be seen. The brief configuration of the cards is as follows, showing that each generation's total bandwidth has roughly doubled, with the next generation expected to reach 1.6Tbps:
Switches
NVIDIA provides switches for both Ethernet and InfiniBand, often with dozens or hundreds of ports. The total throughput (Bidirectional Switching Capacity) is the maximum bandwidth multiplied by the number of ports times two (indicating bidirectional).
The diagram below shows common Spectrum-X series Ethernet switches (mainly listing high-bandwidth data, low-bandwidth support is also available, but the total number of ports is fixed, so it's not particularly meaningful to list here):
Spectrum |
SN3700C |
SN3700 |
SN4600C |
SN4600 |
SN4700 |
SN5400 |
SN5600 |
Size |
1U |
1U |
2U |
2U |
1U |
2U |
2U |
Throughput |
6.4Tbps |
12.8Tbps |
12.8Tbps |
25.6Tbps |
25.6Tbps |
51.2Tbps |
102.4Tbps |
Ports |
|
|
|
|
|
|
800Gbps * 64 |
|
|
|
|
400Gbps * 32 |
400Gbps * 64 |
400Gbps * 128 |
|
|
200Gbps * 32 |
|
200Gbps * 64 |
200Gbps * 64 |
200Gbps * 128 |
200Gbps * 256 |
|
100Gbps * 32 |
100Gbps * 64 |
100Gbps * 64 |
100Gbps * 128 |
100Gbps * 128 |
100Gbps * 256 |
100Gbps * 256 |
|
50Gbps * 64 |
50Gbps * 128 |
50Gbps * 128 |
50Gbps * 128 |
50Gbps * 128 |
50Gbps * 256 |
50Gbps * 256 |
|
Use Case
|
leaf |
spine |
spine |
leaf |
spine |
spine |
leaf |
super-spine |
spine |
super-spine |
super-spine |
super-spine |
|
|
The diagram below shows common Quantum-X series IB switches:
Quantum |
QM8700/8790 |
QM9700/9790 |
X800 Q3400-RA |
Size |
1U |
1U |
4U |
Throughput |
16Tbps |
51.2Tbps |
230.4Tbps |
Ports |
|
|
800Gbps x 144 |
|
400Gbps x 64 |
|
|
200Gbps x 40 |
200Gbps x 128 |
|
|
100Gbps x 80 |
|
|
|
Topology |
SlimFly DragonFly+ 6DT |
Fat Tree SlimFly DragonFly+ Multi-dimensional Torus |
Fat Tree, etc. |
In addition to Mellanox switches, many data centers use modular switches. For instance, Meta's recent "Building Meta's GenAI Infrastructure" mentioned constructing two GPU clusters with 24K H100 each, using Arista 7800 series switches. The 7800 series includes modular switches, with 7816LR3 and 7816R3 offering 576 ports of 400G high-speed bandwidth, connected internally via efficient buses or switch backplanes, resulting in very low transmission and processing latency.
Optical Transceiver Modules
Optical transceiver modules enable fiber optic communication by converting electrical signals to optical signals and vice versa. This technology supports higher transmission rates and longer distances, free from electromagnetic interference. Each module typically includes a transmitter (electrical to optical conversion) and a receiver (optical to electrical conversion).
Common interfaces in fiber optic communication include SFP (Small Form-factor Pluggable) and QSFP (Quad Small Form-factor Pluggable):
SFP: Usually single transmission channel (one or a pair of fibers).
QSFP: Multiple transmission channels, including QSFP-DD (Double Density), offering higher port density with 8 channels.
Interface |
Channels |
Model |
Bandwidth |
Year |
SFP |
1 |
SFP |
1Gbps |
2000 |
SFP+ |
10Gbps |
2006 |
||
SFP28 |
25Gbps |
2014 |
||
SFP112 |
100Gbps |
2021 |
||
QSFP |
4 |
QSFP+ |
40Gbps |
2013 |
QSFP28 |
100Gbps |
2016 |
||
QSFP56 |
200Gbps |
2017 |
||
QSFP112 |
400Gbps |
2021 |
||
QSFP-DD |
8 |
QSFP28-DD |
200Gbps |
2016 |
QSFP56-DD |
400Gbps |
2018 |
||
QSFP-DD800 |
800Gbps |
2021 |
||
QSFP-DD1600 |
1.6Tbps |
2023 |
Recently, OSFP (Octal Small Form-factor Pluggable) packaging has been introduced, targeting high-bandwidth scenarios like 400Gbps and 800Gbps. OSFP modules are larger than QSFP-DD and require converters for compatibility with SFP and QSFP interfaces. The diagram below shows 400Gbps OSFP optical modules for different transmission distances (100m, 500m, 2km, 10km):
Type |
Maximum Transmission Distance |
Wavelength |
Fiber Type |
Connector Type |
Modulation Technology |
Standards |
400G OSFP SR8 |
100 meters |
850 nm |
Multimode Fiber |
MPO/MTP-16 |
50G PAM4 |
IEEE P802.3cm/IEEE 802.3bs |
500 meters |
1310 nm |
Single-mode Fiber |
MPO/MTP-12 |
100G PAM4 |
IEEE 802.3bs |
|
400G OSFP DR4+ |
2 kilometers |
1310 nm |
Single-mode Fiber |
MPO/MTP-12 |
100G PAM4 |
/ |
400G OSFP LR4 |
10 kilometers |
CWDM4 |
Single-mode Fiber |
LC |
100G PAM4 |
100G Lambda Multisource Agreement |
For different distances and scenarios, different optical transceiver modules can be chosen. For example, 10Km 400G LR4 and 800G 2xLR4 between Core and Spine layers, 2Km 400G FR4 between Spine and Leaf layers, and 500m 400G DR between Leaf and ToR layers.
The unit price of optical modules is relatively high, ranging from hundreds to thousands of dollars, depending on factors such as bandwidth and transmission distance. Generally, higher bandwidth and longer distances correlate with higher prices.
Explore more optical module solutions on NADDOD
NADDOD is a leading supplier in the optical transceiver industry, offering a comprehensive range of products from 1G to 800G, including advanced 800G/400G NDR solutions. NADDOD's portfolio supports both InfiniBand and RoCE (RDMA over Converged Ethernet) solutions, catering to various packaging forms such as SFP, QSFP, QSFP-DD, QSFP112, and OSFP. Compared to other vendors, NADDOD distinguishes itself with ready availability of all mainstream products, short delivery cycles, and rigorous 100% verification of all products before shipping. Additionally, NADDOD's timely and reliable service ensures efficient data transfer and communication within data centers.
Optical transceiver modules play a critical role in data center operations. Since each port requires an optical module, the number of optical modules is typically proportional to the number of GPUs, often reaching 4-6 times the number of GPUs. Adding optical modules can significantly enhance network performance and ensure efficient data transfer and communication within the data center.
NADDOD's high-performance and compatibility features make it an ideal choice for data centers aiming to maintain optimal performance and reliability. Our extensive product range and prompt service ensure that data centers can achieve efficient and high-speed connectivity, crucial for large-scale GPU clusters and LLM training setups. Additionally, NADDOD offers cost-effective solutions, minimizing overall expenses while maintaining top-tier performance and reliability.
Conclusion
As the demand for computational power and efficient data transfer continues to rise, the importance of advanced networking solutions becomes increasingly evident. NADDOD's optical transceiver modules offer a robust solution to meet these demands, providing high-speed, reliable, and efficient communication within data centers. This supports the ever-growing needs of modern computational tasks and LLM training.


- 1In-Depth Look at NADDOD's OSFP 800G DR8 and 400G DR4 InfiniBand NDR Transceivers
- 2Detailed Guide to QSFP112 400G Transceivers and Their Connectivity Advantages
- 3Liquid Cooling Technology in Data Centers
- 4NADDOD 1.6T XDR Infiniband Module: Proven Compatibility with NVIDIA Quantum-X800 Switch
- 5Vera Rubin Superchip - Transformative Force in Accelerated AI Compute