Typical 4*L40S/8*L40S main unit
L40S is a new generation of "cost-effective" multi-function GPU that will be launched in 2023, benchmarking A100. Except that it is not suitable for training large models on the base, according to Nvidia's official promotion, it can do almost everything. As for price, the current verbal quotes given by third-party server manufacturers are about 20% off the A100.
1. L40S vs. A100 Configuration and Feature Comparison
One of the key features of L40S is its short time-to-market, which means the cycle from ordering to receiving the product is much faster compared to A100/A800/H800.
L40S utilizes GDDR6 memory and does not rely on HBM capacity (and advanced packaging). This contributes to its affordability, which can be attributed to several factors:
- Most of the cost reduction may come from the GPU itself: certain modules and features might have been removed, or cheaper alternatives might have been used.
- There are also cost savings in the overall system: for example, the removal of a layer of PCIe Gen4 Switch. However, compared to 4x/8x GPUs, the other parts of the system incur almost no additional expenses.
2. L40S vs. A100 Performance Comparison
The following is an official NVIDIA nominal performance comparison:
In conclusion:
- Performance 1.2x ~ 2x (depending on specific scenarios).
- Power consumption: two L40S are about the same as a single A100
L40S servers are officially recommended to be configured with four GPUs instead of eight. Therefore, a common comparison is between two servers with 4x L40S GPUs and a single server with 8x A100 GPUs. Furthermore, it is crucial to have a 200Gbps RoCE or IB network for significant performance improvements in many scenarios.
3. L40S Building Machine
Recommended architecture: 2-2-4
Compared with the 2-2-4-6-8-8 architecture of the A100, the officially recommended L40S GPU host is a 2-2-4 architecture. The physical topology of a machine is as follows:
The most significant change is the removal of the PCIe switch chip between the CPU and GPU. Both the network card and GPU are directly connected to the PCIe Gen4 x16 (64GB/s) slots on the CPU.
Here is the configuration in English:
- 2 CPUs (NUMA)
- 2 dual-port CX7 network cards (2*200Gbps per card)
- 4 L40S GPUs
- Additionally, there is only one storage network card (dual-port) directly connected to any of the CPUs.
This results in an average network bandwidth of 200Gbps per GPU.
4. L40S Networking
NVIDIA officially recommends 4-card models with 200Gbps RoCE/IB networking.
5. L40S Data Link Bandwidth Bottleneck Analysis
Taking two L40S GPUs under the same CPU as an example, there are two connectivity options:
1.Direct CPU Path: GPU0 <--PCIe--> CPU <--PCIe--> GPU1
- PCIe Gen4 x16 bidirectional 64GB/s, unidirectional 32GB/s.
- Concerns about CPU processing bottleneck? TODO (To be determined)
2.Bypass CPU Processing: GPU0 <--PCIe--> NIC <-- RoCE/IB Switch --> NIC <--PCIe--> GPU1
- PCIe Gen4 x16 bidirectional 64GB/s, unidirectional 32GB/s.
- On average, each GPU has a unidirectional 200Gbps network port, equivalent to 25GB/s unidirectional.
- Requires NCCL support, and the latest version of NCCL is being adapted for L40S with the default behavior of bypassing the CPU.
The second option assumes that the network cards and switches are properly configured with a 200Gbps RoCE/IB network. In this network architecture (with sufficient network bandwidth):
The communication bandwidth and latency between any two GPUs are the same, regardless of whether they are within the same machine or under the same CPU. The cluster can scale horizontally (scaling up) compared to scaling in.
The cost of GPU machines can be reduced. However, for applications with lower network bandwidth requirements, the cost of NVLINK is transferred to the network. In such cases, it is necessary to build a 200Gbps network to fully leverage the multi-GPU training performance of L40S.
If the second option is chosen, the bandwidth bottleneck between GPUs within the same host lies in the network card speed. Even with the recommended 2*CX7 configuration, the bandwidth between L40S GPUs is limited to the speed of the network card.
Comparison of bandwidth between GPUs within the same host:
- L40S: 200Gbps (network card unidirectional line rate)
- A100: 300GB/s (NVLINK3 unidirectional) == 12x200Gbps
- A800: 200GB/s (NVLINK3 unidirectional) == 8x200Gbps
It can be observed that the bandwidth between L40S GPUs is 12 times slower than A100 NVLINK and 8 times slower than A800 NVLINK. Therefore, L40S is not suitable for training large models with intensive data interaction.
Related Resources:
High-Performance GPU Server Hardware Topology and Cluster Networking-1
High-Performance GPU Server Hardware Topology and Cluster Networking-2


- 1How Many Optical Modules Does One GPU Need?
- 2NVLink InfiniBand and RoCE in AI GPU Interconnect Technologies
- 3High-Performance GPU Server Hardware Topology and Cluster Networking-2
- 4Introduction to Open-source SONiC: A Cost-Efficient and Flexible Choice for Data Center Switching
- 5OFC 2025 Recap: Key Innovations Driving Optical Networking Forward