Elon Musk has powered on xAI's new supercomputer, which features 100,000 liquid-cooled NVIDIA H100 AI GPUs valued at approximately $4 billion, located at the Memphis Supercluster. This setup is touted as the "most powerful AI training cluster in the world." The GPUs are connected via a single RDMA (Remote Direct Memory Access) network, enabling direct data transfer between computers without the need for operating system intervention. This single RDMA network facilitates high throughput and low-latency communication, making it particularly suitable for large-scale parallel computing clusters.
xAI Builds Massive In-House GPU Supercomputer for GROK 3
Initially, xAI rented NVIDIA's AI chips from Oracle but later opted to build its own servers, terminating the existing agreement with Oracle. The current project aims to construct a supercomputer that surpasses Oracle's capabilities by utilizing 100,000 high-performance H100 GPUs. The NVIDIA H100 GPU is specifically designed for training AI models, which require substantial energy and computational power. In the third quarter of 2023 alone, NVIDIA sold 500,000 H100 GPUs. Earlier this year, Meta CEO Mark Zuckerberg announced that the company would purchase 350,000 H100 GPUs for AI training throughout the year.
Each H100 GPU costs about $30,000, and while GROK 2, finished training last month and is set for release next month, has already utilized 20,000 of these GPUs, comparable to GPT-4, GROK 3 will require five times the power to develop its AI chatbot, is expected to launch in December, positioning it as the world's most powerful AI.
This decision is surprising, especially since NVIDIA is set to launch its newer H200 GPUs in Q3. The H200, which began mass production in Q2, features an advanced Hopper architecture that improves memory configuration and reduces response times for generative AI outputs by 45%.
xAI's Supercomputer Investments Surpasses Industry Giants
From a scale perspective, the xAI Memphis supercomputer now ranks as the most powerful in the world, far surpassing OpenAI's use of 25,000 A100 GPUs for training GPT-4, as well as Aurora (60,000 Intel GPUs) and Microsoft Eagle (14,400 NVIDIA H100 GPUs). It even exceeds the previous record-holder, Frontier, which utilized 37,888 AMD GPUs.
Previously, Musk's xAI had struggled to gain traction, and the AI Grok often faced criticism for its usability. However, given the current landscape, large model training has become a power game, and Musk is determined not to wait, opting to invest heavily in resources.
In terms of computational power, xAI's setup boasts approximately 20 times the capacity of the 25,000 A100 GPUs used by OpenAI for training GPT-4.
Regarding energy consumption, the supercomputer requires a total power capacity of 70 MW to operate, equivalent to the output of a standard power plant, sufficient to meet the energy demands of 200,000 people.
The current AI arms race is intensifying, with speed being a critical factor; the fastest product launch can quickly capture market share. As a startup, xAI must make a strong impact in its competition against other tech giants.
Large-Scale AI Training is Getting More Advanced and Costly
In addition to Musk, OpenAI and Microsoft are also deploying larger-scale supercomputers. One such project, named "Stargate," aims to achieve a million chip count, with costs projected to reach $115 billion, slated for launch in 2028.
In April, OpenAI reportedly caused a power outage at Microsoft due to their deployment of 100,000 H100 training clusters for GPT-6. It remains to be seen whether Musk will be the first to successfully operate 100,000 H100 GPUs simultaneously.
Benefits of Liquid-cooled H100 GPUs
Liquid-cooled NVIDIA H100 GPUs provide superior thermal performance, allowing them to operate at higher clock speeds for extended periods without throttling, which enhances overall processing efficiency. This cooling method also increases power efficiency by reducing overall power consumption and heat generation, making it ideal for large-scale AI training clusters. Additionally, liquid cooling enables higher GPU density per rack unit, maximizing compute capacity within data centers while generating less noise compared to traditional air cooling systems. By maintaining optimal temperatures, liquid cooling improves the reliability and lifespan of the GPUs, reducing the risk of heat-related failures.
NADDOD, as a leading provider of comprehensive data center connectivity solutions, offers a wide range of optical transceivers, DACs, AOCs, AECs, and ACCs to support the growing demands of AI training, high-performance computing, and supercomputing. For liquid-cooled data centers, NADDOD provides energy-efficient 800G liquid cooling optical transceivers to optimize power consumption and performance. Contact our networking experts to learn more about how NADDOD can help you build the next-generation data center infrastructure.
- 1Configuring 51.2T Switch for Scalable AI Networking
- 2800G and Higher Rate Coherent Pluggable Optical Modules
- 3Overview of Active Electrical Cables (AEC)
- 4Focusing AI Frontiers: NADDOD Unveils 1.6T InfiniBand XDR Silicon Photonics Transceiver
- 5NADDOD Leads in Compatibility and Performance on Thor2 & CX7