Last Wednesday, Meta CEO Mark Zuckerberg set a new benchmark for generative AI training, announcing during Meta’s earnings call that the company is training its Llama 4 model "on a cluster that is bigger than 100,000 H100 AI GPUs, or bigger than anything that I've seen reported for what others are doing."
(Image credit: CNET/YouTube)
Speaking to investors and analysts, Zuckerberg noted that Llama 4’s development is progressing well, with a projected release in early 2025. Although he refrained from disclosing specific details about Llama 4’s potential features, he hinted at improvements including “new modalities,” “stronger reasoning,” and “much faster.” This is a crucial step for Meta as it competes with other tech giants, such as Microsoft, Google, and Elon Musk’s xAI, in the race to develop the next generation of AI LLMs.
Meta is not the first company to build a cluster of 100,000 NVIDIA H100 GPUs for AI training. In late July, Elon Musk launched a similar-scale cluster, xAI Colossus, to train its GROK model. Dubbed a “computational superfactory,” xAI Colossus is set to double in size, with an additional NVIDIA's 5,000 H100 and 5,000 H200 GPUs planned. Meta, however, announced earlier this year that it aims to have over 500,000 H100-equivalent GPUs by the end of 2024, indicating that the company may already be using a significant portion of its AI GPUs for Llama 4’s training.
Unlike other models like OpenAI’s GPT-4o and Google’s Gemini, which are accessible only through APIs, Meta takes a unique approach with its AI by openly releasing its Llama models for free. This open-access model allows other researchers, companies, and organizations to build on Llama. However, Meta imposes licensing restrictions on commercial use and does not disclose specifics about its training methods.
With such computational power comes an immense energy demand. A 100,000-GPU H100 cluster requires around 150 megawatts of power—significantly higher than El Capitan, the largest U.S. national lab supercomputer, which uses approximately 30 megawatts. Meta expects to invest up to $40 billion this year in data centers and infrastructure, marking a 42% increase from 2023. Spending is projected to increase even further next year.
Despite this massive investment in AI, Meta’s overall operational costs increased by just 9% this year, while sales (primarily from ads) grew by over 22%. This indicates that Meta’s heavy investments in the Llama project have not hindered its profitability. Meanwhile, OpenAI, often regarded as the leader in advanced AI development, continues to operate at a loss despite charging developers for access to its models. As a nonprofit, OpenAI has confirmed that it is training GPT-5, the anticipated successor to the model currently powering ChatGPT. While GPT-5 will be larger than its predecessor, details about the cluster used for training have not been disclosed. OpenAI’s CEO, Sam Altman, hinted at significant improvements and additional innovations in GPT-5, including a recently developed reasoning technique. Last Tuesday, Google’s CEO, Sundar Pichai, also shared that the company is currently developing the newest version of its Gemini family of generative AI models.
Meta’s open approach to AI has sparked controversy at times. Some AI experts worry that freely releasing powerful AI models could be dangerous, as they might enable bad actors to launch cyberattacks or develop chemical or biological weapons autonomously. Although Llama is fine-tuned to restrict misuse, these restrictions are relatively easy to bypass. Unlike Google and OpenAI, which promote proprietary systems, Zuckerberg remains optimistic about open-source strategies. “It seems pretty clear to me that open source will be the most cost effective, customizable, trustworthy, performant, and easiest to use option that is available to developers,” he stated on last Wednesday, adding that he is proud of Llama’s leadership in this space.
Zuckerberg also indicated that the new features of Llama 4 are expected to enhance Meta’s services even further. The signature product based on the Llama model is Meta AI, a ChatGPT-like chatbot integrated across Facebook, Instagram, WhatsApp, and other Meta platforms. More than 500 million user engage with Meta AI each month. Over time, Meta plans to monetize this feature through ad revenue. Meta CFO Susan Li remarked on Wednesday’s call that “There will be a broadening set of queries that people use it for, and the monetization opportunities will exist over time as we get there.” With advertising potential, Meta may be able to subsidize Llama for wider use.
As AI/ML workloads and LLM training demands continue to grow, stability and efficiency in AI clusters are more critical than ever. These complex workloads require rapid data exchange with seamless, zero-friction data flow and congestion-free connectivity. While choosing high-performance AI GPUs is essential, robust and scalable data center network connectivity forms the backbone of reliable performance.
NADDOD, a leader in AI networking solutions, addresses these needs with advanced technologies like InfiniBand and RoCE RDMA networking, immersion liquid cooling, and data center interconnect (DCI) solutions. NADDOD’s specialized network architects, professional support services, and rapid delivery times enable businesses to achieve optimal AI model training and inference workflows. With these solutions, companies can build dependable, high-performance AI infrastructure that meets the demands of modern AI and ML applications.
![NADDOD](https://resource.naddod.com/images/blog/2024-07-25/blog-goods-010017.webp)
![NADDOD](https://resource.naddod.com/images/products/2024-01-08/1-009111.webp)
- 1Introduction to NVIDIA DGX H100/H200 System
- 2Nine Technologies for Lossless Networking in Distributed Supercomputing Data Centers
- 3Comparing NVIDIA’s Top AI GPUs H100, A100, A6000, and L40S
- 4NADDOD Delivers the First 400G OSFP-RHS SR8 Module to the Market
- 5Focusing AI Frontiers: NADDOD Unveils 1.6T InfiniBand XDR Silicon Photonics Transceiver