Resources

Single-Phase vs. Two-Phase Immersion Cooling in Data Centers

Single-Phase vs. Two-Phase Immersion Cooling in Data Centers

Discover the two primary types of immersion cooling, their mechanisms, benefits, and ideal applications for high-efficiency data centers.
Jason
Nov 14, 2024
Nine Technologies for Lossless Networking in Distributed Supercomputing Data Centers

Nine Technologies for Lossless Networking in Distributed Supercomputing Data Centers

What technologies are essential for building a lossless, scalable, high-performance network in distributed AI and supercomputing data centers?
Jason
Nov 13, 2024
Introduction to NVIDIA DGX H100/H200 System

Introduction to NVIDIA DGX H100/H200 System

This article will focus on the detailed components and features of the NVIDIA DGX H100/H200 system.
Adam
Nov 7, 2024
Meta Trains Llama 4 on a 100,000+ H100 GPU Supercluster

Meta Trains Llama 4 on a 100,000+ H100 GPU Supercluster

Meta is setting a new standard in AI with Llama 4, trained on a supercluster of 100,000+ NVIDIA H100 GPUs. Announced by CEO Mark Zuckerberg, this initiative rivals AI efforts by Microsoft, Google, and xAI. Discover how Meta’s open-source strategy and major infrastructure investments aim to reshape AI’s future and drive new revenue.
Jason
Nov 5, 2024
Comparing NVIDIA’s Top AI GPUs H100, A100, A6000, and L40S

Comparing NVIDIA’s Top AI GPUs H100, A100, A6000, and L40S

Choosing the right GPU is key to optimizing AI model training and inference. NVIDIA’s H100, A100, A6000, and L40S each have unique strengths, from high-capacity training to efficient inference. This article compares their performance and applications, showcasing real-world examples where top companies use these GPUs to power advanced AI projects.
Jason
Nov 1, 2024
Inside xAI Colossus, the 100,000-GPU Supercluster Powered by NVIDIA Spectrum-X

Inside xAI Colossus, the 100,000-GPU Supercluster Powered by NVIDIA Spectrum-X

Elon Musk's xAI Colossus supercomputer, built with 100,000 NVIDIA H100 GPUs and powered by NVIDIA’s Spectrum-X platform, is designed to support advanced AI model training. Built in 122 days, Colossus uses high-speed Ethernet networking and liquid cooling for efficiency and scalability, with future plans to double its capacity.
Jason
Oct 29, 2024
InfiniBand Simplified: Core Technology FAQs

InfiniBand Simplified: Core Technology FAQs

Learn the essential details about InfiniBand technology with this FAQ guide. Explore topics like hardware compatibility, best practices for connecting InfiniBand devices, and how to optimize your network for AI and HPC environments.
Quinn
Oct 24, 2024
Spectrum-X: NVIDIA's Answer to AI Ethernet Challenges

Spectrum-X: NVIDIA's Answer to AI Ethernet Challenges

In this article, we’ll explore the limitations of traditional Ethernet in handling AI traffic—focusing on issues like load imbalance, high latency, and poor congestion control, and also examine how Spectrum-X, with technologies like RoCE adaptive routing and NVIDIA Direct Data Placement, solves these challenges and enhances isolation performance in AI cloud environments.
Jason
Oct 22, 2024
Why AI/ML Networks Rely on RDMA?

Why AI/ML Networks Rely on RDMA?

RDMA plays a vital role in AI/ML deployments by enabling faster data transfers, reducing latency, and minimizing CPU load. Its integration into InfiniBand and Ethernet networks supports efficient AI/ML infrastructures, powering applications like training, inference, and reinforcement learning while enhancing overall network performance.
Jason
Oct 18, 2024