Contact us
Back to

Why AI/ML Networks Rely on RDMA?

Jason
Data Center Architect · Oct 18, 202419440AI Networking

Remote Direct Memory Access (RDMA) enables two servers to directly read or write to each other’s memory without involving the CPU, cache, or operating system. By bypassing these components, RDMA significantly reduces CPU load, minimizes latency, and accelerates data transfers. This makes RDMA highly advantageous for a wide range of applications in networking, storage, and computing.

 

Historically, RDMA was primarily deployed in High-Performance Computing (HPC) environments, particularly in supercomputing projects, with limited adoption in cloud computing and enterprise data centers. However, this changed dramatically by late 2022, as AI/ML became a major focus of investment. As data center spending shifted rapidly toward AI/ML deployments, RDMA, which was originally designed for large-scale parallel computing in HPC clusters, emerged as a key enabler for AI/ML workloads due to its inherent ability to efficiently handle massively parallel processing tasks.

 

RDMA mode vs. traditional mode

 

The scale and speed of this transition have been unprecedented. By the end of 2023, the deployment rate of RDMA-based networks had surpassed the combined totals of 2021 and 2022, signaling a rapid shift toward this technology. This surge in adoption positions RDMA as a critical component in the growth of AI/ML infrastructure. According to 650 Group, the RDMA networking market is projected to exceed $22 billion by 2028.

 

InfiniBand and RoCEv2 are the two primary implementations of RDMA. InfiniBand offers a dedicated RDMA solution, while RoCEv2 utilizes Ethernet infrastructure to provide RDMA over standard Ethernet networks. Early versions of RoCE required Converged Ethernet, but modern iterations now run on standard Ethernet. The industry is actively working to improve Ethernet congestion control, a critical factor in reducing packet loss and supporting RoCE in high-performance environments.

 

With more than 400 million Ethernet switch ports already installed in data centers globally, Ethernet is expected to play an increasingly prominent role in AI/ML networking. As a result, more RDMA operations will likely be performed over Ethernet in the near future.

 

Changes in the Server Market

As the shift from general-purpose servers to those designed specifically for AI/ML workloads accelerates, the server market is undergoing a major transformation. According to 650 Group, the number of AI/ML servers is projected to skyrocket from 1 million units in 2023 to over 6 million by 2028.

 

At the same time, the value of the AI/ML server market is expected to approach $300 billion. This rapid rise in AI/ML will double data center spending, with the majority of these 6 million servers expected to be equipped with backend networks or AI-specific architectures to connect compute nodes.

server market changes - 650 group

The number of GPUs and AI ASICs per server is also set to increase with each product generation. Today, servers equipped with 8 GPUs are common, but soon, servers with 16 or even 32 GPUs will become the norm. As the parameter size of AI models expands from billions to trillions, the memory capacity of individual GPUs will need to grow accordingly. In this context, RDMA plays a critical role by improving the efficiency of data transfers between servers, which is essential for scaling systems and achieving the ambitious goals of AI model training.

 

Job Completion Time (JCT) and Performance Metrics

The ability to directly access memory on other servers greatly enhances overall AI model performance. With RDMA, data is delivered to GPUs faster, reducing JCT and improving overall performance metrics.

 

In the early stages of AI/ML cluster development, one of the key challenges was that GPU cores often remained idle due to packet loss or delays in data delivery to the GPUs. This could cause the entire cluster to halt, leaving all computing resources underutilized. RDMA effectively resolves these network bottlenecks, optimizing JCT and improving performance across the board. While there may be minor performance differences between Ethernet and InfiniBand, RDMA still represents a significant improvement over traditional network technologies.

 

NIC Market Evolution

While all InfiniBand NICs support RDMA, not all Ethernet NICs are currently equipped to handle RDMA/RoCE. For traditional Ethernet NIC manufacturers aiming to compete in the AI/ML space, integrating RoCE functionality into their products is crucial. As NIC speeds increase to 400 Gbps and beyond, it is anticipated that most Ethernet NICs will support RoCE. The addition of advanced features and higher port speeds will also drive up the average selling price (ASP) of Ethernet NICs.

 

The integration of RoCE capabilities into Ethernet NICs will vary based on factors such as the type of processor, offload engines, and the expertise of the R&D teams. This will lead to varying levels of performance across NIC suppliers. However, as new product generations emerge, vendors are expected to continually optimize RoCE performance, gradually narrowing the performance gap and improving interoperability between devices, giving users more flexibility and options.

 

AI/ML Backend Networks

Most AI/ML servers rely on backend networks, which operate independently from the primary data center network. These backend networks can be based on either InfiniBand or Ethernet, and their primary purpose is to interconnect servers within the AI/ML cluster. The main focus is on providing high-speed connections between GPUs and between GPUs and memory. Backend networks complement existing infrastructures by adding more ports to each server, significantly increasing the market potential for network equipment.

 

AI/ML applications often involve multiple backend networks, each designed for specific tasks. For example, while RDMA can run over both Ethernet and InfiniBand, some GPU or AI ASIC providers may opt for other network types to create higher-performance solutions.

 

RDMA Market Evolution and Growth

According to 650 Group, before 2021, the RDMA market was driven by HPC, generating annual revenues between $400 million and $700 million. However, the rise in demand for AI/ML deployments has led to a rapid increase in RDMA adoption. By 2023, the market had expanded to over $6 billion, with projections to surpass $22 billion by 2028. This growth is expected to accelerate as operators continue to boost CAPEX spending on AI/ML infrastructure, further increasing RDMA-related project budgets in the years ahead.

 

rdma-market-forecast-by-650-grouprdma-market-revenue

 

The RDMA market can be divided into two primary categories.

 

  • Technology-based: Today, RDMA is predominantly deployed with InfiniBand, but RDMA over Ethernet is expected to gain more traction as adoption increases.
  • Hardware-focused: The market primarily revolves around NICs and switches. In InfiniBand environments, NICs and switches are typically procured together, while in Ethernet environments, these components are often purchased by separate teams on different procurement cycles. This divide between server and network teams continues to exist in AI/ML deployments, despite the growing convergence within AI/ML architectures.

 

As a result, it is anticipated that customers will develop distinct preferences for NIC and switch suppliers in their AI/ML networks, which may differ from those used in more traditional compute environments.

 

Final Thoughts

RDMA and RoCE are essential technologies for AI/ML networks. Without them, the rapid scaling needed to meet the growing demands of AI/ML infrastructure simply wouldn’t be possible. As the server market continues to shift from traditional computing to AI/ML, RDMA and RoCE are set to see significant growth, with their adoption expected to accelerate in the coming years.

 

While there will be differences in preferences for technologies and vendors, RDMA is set to experience widespread adoption and rapid growth. Ethernet and InfiniBand will coexist, and rather than choosing one over the other, most customers will deploy RDMA across both networks. It’s rare for networks to rely on a single type or supplier, and RDMA’s flexibility ensures it will play a key role in mixed environments.

 

Given the diverse range of AI/ML applications—from basic training and reinforcement learning to inference—ensuring RDMA works seamlessly across different platforms is crucial. This allows customers to focus on optimizing AI/ML workloads instead of worrying about underlying network issues.

 

For AI/ML networks that depend on RDMA, NADDOD offers full support, whether through InfiniBand or RoCEv2 solutions. NADDOD provides high-speed InfiniBand and RoCE transceivers, DACs, AOCs, and 51.2T AI data center switches for Ethernet networks, ensuring the performance needed for AI cloud and AI factory deployments.

 

naddod

Source: 650 Group

Related Products

We use cookies to ensure you get the best experience on our website. Continued use of this website indicates your acceptance of our cookie policy.