Data Centers · Report

AMD is being approached to build the world's fastest AI supercomputer powered by 1.2 million GPUs.

This 30x scale increase over current fastest supercomputer signals massive AI infrastructure acceleration and validates AMD as viable Nvidia alternative.

Trade pressSlicast · June 25, 2024 · Global · Source: tomshardware.com

importance 95

Demand for computing power in data centers is growing at a staggering pace, according to AMD. The company has revealed that it has had serious inquiries to build single AI clusters packing 1.2 million GPUs or more. This disclosure comes from a discussion The Next Platform had with Forrest Norrod, AMD's EVP and GM of the Datacenter Solutions Group, about the future of AMD in the data center. When directly asked if the company has fielded inquiries for clusters as large as 1.2 million GPUs, Forrest confirmed the assessment was virtually spot on, stating: "It's in that range? Yes" and emphasizing "I'm talking about one machine."

The scale of 1.2 million GPUs is extraordinary when compared to existing infrastructure. The fastest operational supercomputer currently available, Frontier, "only" has 37,888 GPUs. Even the most powerful supercomputers in the world do not scale to millions of GPUs, meaning any such cluster would be 30 times larger than today's largest known clusters.

Building an AI cluster of this magnitude presents profound technical challenges. AI workloads are extremely sensitive to latency, particularly tail latency and outliers, wherein certain data transfers take much longer than others and disrupt the workload. Additionally, today's supercomputers must mitigate GPU or other hardware failures that occur every few hours at current scales—issues that would become far more pronounced when scaling to such unprecedented sizes. Beyond these operational concerns, the power delivery required would be on the scale of a nuclear power plant.

Despite these obstacles, the exploration of million-GPU clusters reflects the intensity of the AI race shaping the 2020s. Forrest did not identify which organization is considering building a system of this scale, but noted that "very sober people" are contemplating spending tens to hundreds of billions of dollars on AI training clusters, explaining why clusters with millions of GPUs are being considered at all. If achieving such scale is within the realm of possibility, competitive pressures suggest someone will attempt it in pursuit of greater AI processing power.

Read the original