In distributed supercomputing data centers, achieving a lossless network requires integrating advanced technologies across both the IP network and optical transport layers. These solutions address demands for long-distance, ultra-high bandwidth, high reliability, agile scalability, and intelligent network operations. The primary technical elements include:
Key Technologies in Lossless Networking for Distributed SuperComputing Data Centers
Optimizing Collective Communication for Heterogeneous Networks
In heterogeneous networks with asymmetric bandwidth and latency, particularly across long-distance links, collective communication algorithms help adjust traffic flow, reducing congestion. In homogeneous networks, traffic symmetry enables each node to handle an equal bandwidth load. However, in heterogeneous settings, adjustments are essential for optimizing flow, such as minimizing data transmission on long-distance links, thus lowering congestion risks.
Supercomputing communication patterns are typically collective, with common methods including AllGather and AllReduce operations:
Collective Communication Operations
- AllGather: Each server transmits unique data parts to all others.
- AllReduce: Each server transmits similar data to all others, performing operations like summation, maximum, or averaging across destinations.
Common algorithms include the Ring and Halving-Doubling (HD) methods. Ring is simpler and involves only neighbor communication, while HD is complex but efficient, reducing latency impact, especially for small data payloads. Both algorithms assume a homogeneous network where every node exhibits identical transmission and reception characteristics.
In long-distance, distributed environments, networks are often heterogeneous, where GPU communication latency across distances is significantly higher than within a single data center. This renders traditional algorithms, like Ring and Halving-Doubling (HD), suboptimal. The table below compares these algorithms’ performance across long distances, where S denotes communication data size, and N represents the GPU count involved in collective communication.
Performance Evaluation of Typical Collective Communication Algorithms over Long Distances
To optimize communication in long-distance heterogeneous networks, a new framework customizes collective operations as follows:
Step 1: Treat each distributed data center (DC) as an independent subsystem, performing local collective operations with algorithms like Ring or HD.
Step 2: After local synchronization, designated representative servers (fewer than N/2) handle inter-DC synchronization. Each representative transmits S/K data, where K is the number of selected representatives. This step results in K-point bidirectional communication across the network.
Afterward, each representative aggregates the received data locally, then distributes the result across its respective DC. This framework enables efficient AllReduce operations across DCs by minimizing inter-DC traffic and only sending the required data size, S, cross-distance.
Long-Distance Collective Communication Algorithm Architecture
A simulation (image below) for an AllReduce operation with S=1 GB over a 100 km span shows that the new algorithm outperforms the traditional Ring algorithm, improving from 5% to 60% as the system scales. With only one cross-distance communication step and optimized data transfer, this approach achieves near-theoretical optimality.
Performance Simulation of the New Algorithm
In real-world deployments, integrating network equipment to broadcast topology information is essential. Devices continuously monitor link distance at the link layer, constructing and maintaining a topology map. This map is distributed to each server’s collective communication library via a controller. During each collective operation, the library uses this map and search algorithms to find the most efficient communication path based on source and destination distances.
Network Load Balancing Technology
Network load balancing mitigates congestion and packet loss in homogeneous, fault-free networks within supercomputing environments. These networks rely on synchronized, high-volume, and cyclical data flows for collective communication. "Homogeneous" indicates uniform bandwidth and latency, while "fault-free" means the network is free of hardware failures, such as broken links or slow nodes. Network load balancing evenly distributes traffic across available paths, reducing conflicts and enhancing flow efficiency.
Supercomputing traffic is highly synchronized, high-volume, and cyclical, meaning all equivalent paths are occupied simultaneously. Traditional ECMP (Equal-Cost Multi-Path) load balancing, which uses hash-based methods, often fails to achieve ideal distribution in high-traffic environments. This can lead to congestion in some paths while others remain underutilized. For example, with 8 data flows distributed across 8 paths, random hashing may cluster multiple flows on certain paths, creating bottlenecks and idle links.
Network Load Balancing Technology
As shown in the figure above, network-level load balancing can preemptively distribute network traffic, balancing all paths without congestion or packet loss. Network devices collect traffic data and forward it to the network controller. The controller processes this data with the network topology and runs a global routing algorithm to assign optimal paths for each flow. The controller sends the routing information back to network devices, enabling precise adjustments and ensuring congestion-free performance.
Priority Flow Control Technology
Priority Flow Control (PFC) includes two approaches: PFC 1.0, applicable within switch networks, and PFC 2.0, an enhanced solution coordinating switches and routers across multiple AI computing centers.
PFC 1.0: For Switch Networks
PFC 1.0 aims to prevent performance degradation from packet loss due to network issues affecting AI workloads. While network load balancing mitigates congestion and packet loss in normal conditions, certain unexpected scenarios—such as optical transceivers faults, packet errors on long-distance links, and server-side congestion—pose unique challenges that standard load balancing cannot resolve, thereby reducing training performance.
Upon network faults, whether link-related or due to reduced server data reception rates, network throughput decreases, leading to unavoidable congestion. In data centers, congestion can be quickly mitigated through standard flow control mechanisms due to shorter feedback times. However, on long-distance links, extended feedback times and limited buffer capacities can lead to packet loss.
Switch-based priority flow control diverts unavoidable congestion from long-distance links to the initial network device hop. Network devices monitor queue build-up and port backpressure to detect congestion. When congestion arises and affects downstream nodes, devices notify the first-hop (source Leaf) switch to initiate flow regulation. Based on congestion intensity, the Leaf switch throttles affected traffic by sending PFC, CNP, or other control protocol packets, as shown in figure below.
PFC for Switches
In large AI model training, data flows exhibit a cyclical congestion pattern: if a flow experiences congestion in one cycle—due to link faults or limited destination capacity—this congestion is likely to recur in subsequent cycles. Based on this characteristic, the source Leaf switch maintains a table to record which flows experience congestion. This allows for immediate rate-limiting of recurring congested flows, avoiding delays caused by distant congestion notifications. Thus, with the initial cycle’s congestion data, lossless traffic flow can be achieved in subsequent cycles.
PFC 2.0: Enhanced for Multi-Center AI Systems
During collaborative training across supercomputing data centers, congestion or faults may occur on any network node or link, with inter-center distances amplifying feedback delays. Switches and routers coordinate under PFC 2.0 to tackle sudden congestion while maintaining performance during extended faults. Flow-level backpressure mitigates congestion propagation, significantly improving throughput.
Compared to conventional PFC, router-based PFC addresses issues like head-of-line blocking, backpressure storms, and deadlocks by moving from port-level to flow-level controls. By using IP datagrams as flow identifiers, it independently monitors and adjusts flows to minimize congestion and fault impact.
In inter-data center scenarios, where environments are complex and variable, router-based PFC employs flow-based control and precise buffer scheduling for lossless long-distance data transfer, safeguarding transmission continuity. In long-term fault scenarios, router PFC excels, regulating flow with pinpoint accuracy to minimize packet loss and achieve near-optimal bandwidth utilization, even under throttle constraints.
In response to dynamic workloads within data centers, router-based PFC offers high flexibility and intelligence, dynamically adjusting flow control strategies based on real-time network conditions. This approach ensures independent rate control at the flow level and precise backpressure, effectively handling surges and isolating faults to stabilize network operations. Additionally, the elastic cascading rate-reduction mechanism enhances adaptability and network resilience to sudden traffic bursts.
Optical Transceiver Channel Resilience Technology
Network or optical transceiver failures between devices can disrupt training. It’s estimated that annual failure rates for 400G/200G optical transceivers are approximately 4–6‰. In large-scale AI clusters, which experience around 60 transceiver failures per year (or one every six days on average), these failures pose significant challenges. Training with large models relies on periodic checkpoints to resume from the most recent point after failures, but frequent network disruptions can lead to inefficient rollbacks, reducing overall productivity.
It's said that over 90% of optical transceiver failures are due to laser issues. Short-range 200GE/400GE SR transceivers use four channels, meaning a single laser failure can disrupt the entire link. Channel resilience technology mitigates this by reducing the number of active lanes within the module if one channel fails, thus maintaining training continuity.
In addition to resilient design, selecting high-performance optical transceivers with a low BER is crucial to minimize link downtime and boost system reliability. NADDOD offers high-performance transceivers and cables for InfiniBand and RoCE networking, optimized for AI/ML workloads.
Flow Monitoring and Packet Loss Detection Technology
Packet loss in Remote Direct Memory Access over Converged Ethernet (RoCE) networks can significantly impact training performance. Ensuring communication quality in both local and long-distance links requires comprehensive flow sampling and real-time RoCE traffic monitoring. Administrators must detect packet loss promptly, including precise location, volume, and timing, to assess network impact and resolve issues efficiently.
Key features of flow monitoring and packet loss detection:
- Rapid Fault Localization: Real-time monitoring detects latency and packet loss instantly.
- Flow Path Visualization: Enables centralized network management.
RoCE Application Scenario
In distributed AI compute centers, long-distance interconnects require designated ingress, transit, and egress points to manage flow statistics. Compute access leaf switches serve as Ingress and Egress points, while Spine and DCI leaf switches act as Transit nodes.
- Ingress: Marks entry points, tags flow characteristics, and reports data to an analyzer.
- Transit: Tracks data tagged at the ingress and relays it to an analyzer.
- Egress: Measures exit flow and removes tags, feeding the results to an analyzer.
Packet loss and latency metrics also measure flow reliability:
- Packet Loss: Calculated as the difference between incoming and outgoing flow volume within a monitoring period.
- Latency: Measures time from entry to exit across two nodes.
High-Bandwidth Transmission Technology
Raising single-port rates is essential for efficient, cost-effective data transfer in AI interconnect networks. Today, 800Gbps mid-range port technology is mature and deployed in AI compute networks for connections up to 100 km. These connections support high-throughput AI tasks while reducing overall costs. Future developments target 1.2Tbps port rates to lower per-bit transmission costs.
Continuous Improvement in Unicast Rate
As transmission technology progresses from single-wavelength 400Gbps to 800Gbps and up to 1.2Tbps, the spectral width used per signal also increases. To maximize single-fiber capacity, the industry is extending from the traditional C-band to the L-band, expanding the spectrum into a combined C+L band that can support up to 96Tbps. This upgrade significantly boosts transmission capabilities to handle the high data demands between AI compute centers.
Wavelength-Level Dynamic Reconfiguration Technology
In distributed compute centers, resources are often leased on a time-sharing basis. Thus, flexible wavelength-level interconnections between compute centers based on available GPUs are essential, with bandwidth dynamically adjusted based on factors like latency and distance. To support these needs, optical transport networks (OTNs) must feature wavelength-level reconfigurability (also known as electrical-optical technology).
Two main application scenarios for electrical-optical technology are:
- Wavelength-Level Reconfiguration: Dynamically adjust the wavelength layer.
- ODU-Level Reconfiguration: Adjust both electrical cross-connect and wavelength layers dynamically.
Based on service demands (e.g., source/destination points, routing policies, protection levels), electrical-optical technology provides the following capabilities:
- Cross-Layer Routing Coordination: Automatically calculate optimal OCH paths that meet business constraints like latency and route separation.
- Optical-Electrical Cross-Link Creation: Automatically generate configuration parameters such as Client-to-OCH mappings, wavelength frequencies, fiber mappings, and relay ports.
- Automated Testing and Tuning: Perform cross-layer routing and automated testing to ensure optimal performance.
High-Performance Wavelength Switched Optical Network (WSON) Technology
Traditional WSON rerouting times can last from seconds to minutes, risking interruptions in large-scale computing. Improved WSON capabilities are essential for fast, deterministic optical-layer recovery. In modern OTNs, electrical-layer SNCP protection provides 50ms protection but requires extensive resources. To optimize for AI workloads, WSON’s 50ms protection reduces resource use while maintaining reliability.
Key components of WSON 50ms technology include:
- Control-Data Plane Separation: Decouples path calculation, resource allocation, and path establishment to ensure only essential rerouting tasks are performed, independent of network size or load.
- Shared Resource Routing Algorithm: Optimizes network resources globally to enable shared, conflict-free resource recovery.
- High-Speed Packet Forwarding: Uses specialized chips for fast forwarding, reducing dependence on CPU and route hops.
- WSS Fast Switching: Achieves millisecond-level wavelength switching via LCOS technology, enabling multi-failure recovery within 50ms.
Alarm Compression and Root Cause Identification Technology
In AI model training, fault recovery within 10 minutes is critical to prevent extended disruptions. As OTN networks grow, managing network elements under a unified NMS introduces complexities that challenge traditional fault-handling approaches. Issues like alarm overload, root cause complexity, and time-consuming localization impact service continuity, necessitating intelligent O&M.
To streamline fault management, two core solutions are employed:
Intelligent Fault Detection and Identification
Network elements use integrated modules to analyze and report alarm relationships, generating real-time fault propagation maps. These maps synthesize alarm flows, network topology, and protection configurations to isolate root causes promptly.
Advanced Margin Performance Evaluation
To ensure stable add/drop operations, digital modeling predicts the operational feasibility of each wavelength, preventing service interruptions. Quality of Transmission (QoT) models and algorithms assess OSNR margin changes, aiding in fault boundary localization and evaluating system capacity for real-time fault detection.
Through intelligent reasoning, alarm compression reduces the volume of alarms, accelerating fault localization. Precise margin assessment also provides advance OSNR evaluation for operational adjustments, lowering the risk of service impact.
![NADDOD](https://resource.naddod.com/images/blog/2024-07-25/blog-goods-010017.webp)
![NADDOD](https://resource.naddod.com/images/products/2024-01-08/1-009111.webp)
- 1Single-Phase vs. Two-Phase Immersion Cooling in Data Centers
- 24 Common Spectrum-X Product Solutions in Ethernet Networking
- 3Introduction to NVIDIA DGX H100/H200 System
- 4NADDOD Delivers the First 400G OSFP-RHS SR8 Module to the Market
- 5Focusing AI Frontiers: NADDOD Unveils 1.6T InfiniBand XDR Silicon Photonics Transceiver