What’s the Ideal Switch for AI Workloads?

In AI and HPC environments, traditional TCP/IP networks struggle with latency and CPU overhead. RDMA (Remote Direct Memory Access) has become a crucial alternative, offering high-throughput and low-latency communication without taxing the CPU. Technologies like InfiniBand, designed for hardware-level reliable transmission, and Ethernet-based solutions like RoCE and iWARP are now fundamental in modern AI infrastructures. This article covers network protocols, the role of switches in data center architecture, the relationship between NVIDIA switches and InfiniBand switches, the NVIDIA SuperPOD, and the latest trends in AI-driven network switches.

Network Protocols: The Foundation of Networking

Network protocols establish the rules for exchanging data between computing devices. The OSI (Open System Interconnection) model breaks this into seven layers, each governing a specific aspect of data communication, from physical transmission (Layer 1) to application services (Layer 7).

The OSI layers include:

Physical Layer: Manages data transmission as electrical signals or light pulses, defining hardware interfaces and transmission rates.
Data Link Layer: Provides error detection and frames data for transport to the next layer.
Network Layer: Finds and manages the logical IP addresses to ensure data is delivered to the correct destination.
Transport Layer: Oversees data transmission quality, including error checking and retransmission when packets are lost.
Session Layer: Establishes, maintains, and terminates communication between networked devices.
Presentation Layer: Formats data for the application layer, handling encryption, decryption, and translation between data formats.
Application Layer: Interfaces with software applications to allow users access to network services like web browsers or email clients.

osi-model OSI Model 7 Layer
Source:Cloudflare

In contrast, TCP/IP-a simplified four-layer protocol commonly used on the internet-cannot meet the demands of modern AI networks due to latency and high CPU dependency.

As AI requires real-time processing, TCP/IP's limitations become clear:

Latency: Frequent context switching introduces delays of up to several microseconds.
CPU Overhead: CPU cycles are consumed during packet handling, making it inefficient for large-scale AI operations.

Thus, AI has shifted toward RDMA, which allows memory-to-memory data transfer directly via the network interface, bypassing the CPU. RDMA ensures high throughput and low latency, making it ideal for parallel computing environments like AI clusters. There are several key RDMA technologies:

InfiniBand: A high-performance RDMA solution designed for reliable, low-latency transmission but with high costs.
RoCE (RDMA over Converged Ethernet) and iWARP: RDMA solutions that use Ethernet as a base, offering a more cost-effective alternative to InfiniBand.

NADDOD offers a complete optical interconnect portfolio, including both InfiniBand and RoCE solutions, for AI/ML networks, providing reliable, high-speed connectivity for large-scale clusters.

The Role of Switches in Data Centers

Switches and routers play different roles in network management. While switches operate at the data link layer (Layer 2) and use MAC addresses for packet forwarding, routers function at the network layer (Layer 3), connecting different networks using IP addresses.

Traditional data centers have long relied on a three-tier architecture, consisting of access, aggregation, and core layers. In smaller data centers, the aggregation layer may be skipped, with Top of Rack (TOR) switches connecting directly to servers. Core switches manage the overall traffic flow, but as cloud computing and AI grow, the three-tier model exposes several limitations:

Bandwidth inefficiencydue to VLAN limitations at the aggregation layer.
Large fault domains, where topology changes can disrupt network stability.
Increased latencyas east-west traffic between servers grows, placing additional pressure on core and aggregation switches.

With the rise of AI workloads, many data centers have shifted to the leaf-spine architecture, which flattens the network into two layers: leaf and spine. Leaf switches connect directly to servers, acting as the access layer, while spine switches handle the interconnection of leaf switches, taking over the role of the core.

In this architecture, Equal Cost Multi-Path (ECMP) routing is used to dynamically select the most efficient data paths, preventing bottlenecks and ensuring high performance. Each leaf switch is connected to every spine switch, providing redundancy. In the event of a spine failure, overall throughput is only slightly reduced, making this design both robust and scalable. This non-blocking fabric is critical for AI workloads, where low latency and high throughput are essential.

With AI driving exponential growth in data processing, the demand for high-speed, low-latency switches has never been greater.

NVIDIA Switches vs. InfiniBand Switches

NVIDIA offers both Ethernet (Spectrum series) and InfiniBand (Quantum series) switches. InfiniBand, originally developed by Mellanox (acquired by NVIDIA in 2020), remains a dominant choice in AI clusters due to its superior low-latency and high performance connections. In contrast, NVIDIA's Spectrum-4 (released in 2022) is a 400G Ethernet switch aimed at accelearateed ethernet data centers workloads, Spectrum-4 provides the most advanced and innovative feature set yet.

Spectrum-X, tailored for AI, enhances traditional Ethernet switch capabilities. Key components include Spectrum-4 Ethernet switches and BlueField-3 DPUs, optimizing AI performance. It supports RoCE for AI applications, adaptive routing, and delivers up to 95% bandwidth efficiency. Spectrum-X also ensures performance isolation in multi-tenant environments and handles hardware failures without compromising performance. Integrated with BlueField-3 DPUs, it optimizes NCCL (NVIDIA Collective Communication Library) and AI workloads, guaranteeing stable performance across various AI tasks, which is critical for meeting SLAs.

Choosing between InfiniBand and Ethernet (RoCEv2) depends on the application. InfiniBand excels in large-scale computing scenarios but has a smaller market share compared to Ethernet. According to ISC 2021, InfiniBand held a 70% share in the top 10 supercomputers but only 65% in the top 100. For AI workloads, NVIDIA envisions that AI application scenarios can be broadly divided into AI Cloud and AI Factory. In AI clouds, traditional Ethernet and Spectrum-X will be used, while NVLink + InfiniBand is essential in AI factories.

NVIDIA SuperPOD

The NVIDIA SuperPOD is a high-performance server cluster designed to offer massive throughput by connecting multiple compute nodes. A notable example is the DGX A100 SuperPOD, which uses QM9700 switches with 40x 200G ports and a non-blocking, fat-tree architecture. Each DGX A100 server connects to 8 leaf switches, forming 20 servers per SU (Scalable Unit). In this architecture, leaf and spine switches ensure low-latency data transmission, with 8 leaf switches connecting to 5 spine switches per SU. As the number of SUs increases, additional spine switches are added for scalability.

DGX-A100-SuperPOD-architecture DGX A100 SuperPOD architecture
Source: Nvidia

For larger configurations like the DGX H100 SuperPOD, each SU consists of 31 servers, with each QM9700 switch providing 64x 400G ports. The DGX H100 SuperPOD utilizes Sharp technology, constructing Streaming Aggregation Trees (SAT) within the network topology. This allows for parallel processing across multiple switches, reducing latency and boosting performance significantly. The QM9700 can support up to 64 SATs, far surpassing the 2 SATs supported by older models like the QM8700.

NADDOD 51.2T Switches for the Growing Data Center Market

The data center switch market is witnessing robust growth, driven by the rising demand for high-speed AI networks. According to IDC, revenues increased 7.6% year-over-year in Q2 2024, with 200/400 GbE switches leading the charge, showing a 104.3% annual growth.

As the need for faster, high-density networking intensifies, NADDOD’s N9500 series RoCE switches (N9500-128QC and N9500-64OC) stand out for high-performance AI/ML data centers. Powered by Broadcom Tomahawk 5, they support up to 51.2 Tbps, deliver superior scalability with Layer 2 networks, and offer cost savings (around 50%+) over InfiniBand solutions. With fast deployment capabilities and comprehensive optical interconnects, NADDOD ensures reliable, future-proof network infrastructures for AI workloads.