High-Performance GPU Server Hardware Topology and Cluster Networking - L40S

**Typical 4L40S/8L40S main unit**

L40S is a new generation of "cost-effective" multi-function GPU that will be launched in 2023, benchmarking A100. Except that it is not suitable for training large models on the base, according to Nvidia's official promotion, it can do almost everything. As for price, the current verbal quotes given by third-party server manufacturers are about 20% off the A100.

1. L40S vs. A100 Configuration and Feature Comparison

One of the key features of L40S is its short time-to-market, which means the cycle from ordering to receiving the product is much faster compared to A100/A800/H800.

L40S utilizes GDDR6 memory and does not rely on HBM capacity (and advanced packaging). This contributes to its affordability, which can be attributed to several factors:

Most of the cost reduction may come from the GPU itself: certain modules and features might have been removed, or cheaper alternatives might have been used.

There are also cost savings in the overall system: for example, the removal of a layer of PCIe Gen4 Switch. However, compared to 4x/8x GPUs, the other parts of the system incur almost no additional expenses.

2. L40S vs. A100 Performance Comparison

The following is an official NVIDIA nominal performance comparison:

NVIDIA L40S Specifications

In conclusion:

Performance 1.2x ~ 2x (depending on specific scenarios).

Power consumption: two L40S are about the same as a single A100

L40S servers are officially recommended to be configured with four GPUs instead of eight. Therefore, a common comparison is between two servers with 4x L40S GPUs and a single server with 8x A100 GPUs. Furthermore, it is crucial to have a 200Gbps RoCE or IB network for significant performance improvements in many scenarios.

3. L40S Building Machine

Recommended architecture: 2-2-4

Compared with the 2-2-4-6-8-8 architecture of the A100, the officially recommended L40S GPU host is a 2-2-4 architecture. The physical topology of a machine is as follows:

L40S gpu

The most significant change is the removal of the PCIe switch chip between the CPU and GPU. Both the network card and GPU are directly connected to the PCIe Gen4 x16 (64GB/s) slots on the CPU.

Here is the configuration in English:

2 CPUs (NUMA)

2 dual-port CX7 network cards (2*200Gbps per card)

4 L40S GPUs

Additionally, there is only one storage network card (dual-port) directly connected to any of the CPUs.

This results in an average network bandwidth of 200Gbps per GPU.

4. L40S Networking

NVIDIA officially recommends 4-card models with 200Gbps RoCE/IB networking.

5. L40S Data Link Bandwidth Bottleneck Analysis

Taking two L40S GPUs under the same CPU as an example, there are two connectivity options:

1.Direct CPU Path: GPU0 <--PCIe--> CPU <--PCIe--> GPU1

PCIe Gen4 x16 bidirectional 64GB/s, unidirectional 32GB/s.

Concerns about CPU processing bottleneck? TODO (To be determined)

2.Bypass CPU Processing: GPU0 <--PCIe--> NIC <-- RoCE/IB Switch --> NIC <--PCIe--> GPU1

PCIe Gen4 x16 bidirectional 64GB/s, unidirectional 32GB/s.

On average, each GPU has a unidirectional 200Gbps network port, equivalent to 25GB/s unidirectional.

Requires NCCL support, and the latest version of NCCL is being adapted for L40S with the default behavior of bypassing the CPU.

The second option assumes that the network cards and switches are properly configured with a 200Gbps RoCE/IB network. In this network architecture (with sufficient network bandwidth):

The communication bandwidth and latency between any two GPUs are the same, regardless of whether they are within the same machine or under the same CPU. The cluster can scale horizontally (scaling up) compared to scaling in.

The cost of GPU machines can be reduced. However, for applications with lower network bandwidth requirements, the cost of NVLINK is transferred to the network. In such cases, it is necessary to build a 200Gbps network to fully leverage the multi-GPU training performance of L40S.

If the second option is chosen, the bandwidth bottleneck between GPUs within the same host lies in the network card speed. Even with the recommended 2*CX7 configuration, the bandwidth between L40S GPUs is limited to the speed of the network card.

Comparison of bandwidth between GPUs within the same host:

L40S: 200Gbps (network card unidirectional line rate)

A100: 300GB/s (NVLINK3 unidirectional) == 12x200Gbps

A800: 200GB/s (NVLINK3 unidirectional) == 8x200Gbps

It can be observed that the bandwidth between L40S GPUs is 12 times slower than A100 NVLINK and 8 times slower than A800 NVLINK. Therefore, L40S is not suitable for training large models with intensive data interaction.

Related Resources:

High-Performance GPU Server Hardware Topology and Cluster Networking-1

High-Performance GPU Server Hardware Topology and Cluster Networking-2