Chips & Hardware · Report

AMD Instinct GPU line competes for share in hyperscale data center deployments against Nvidia's market-leading offerings.

Viable competitive alternative to Nvidia intensifies GPU market dynamics and introduces cost competition in the AI accelerator segment.

Trade pressSlicast · December 23, 2024 · Global · Source: theregister.com

importance 65

Nvidia dominated the AI infrastructure market in 2024, with shipments of its Hopper GPUs more than tripling to over two million among its 12 largest customers, according to estimates from Omdia. However, the company faces intensifying competition from AMD's Instinct MI300 series. Omdia estimates that Microsoft purchased approximately 581,000 GPUs in 2024 as the largest cloud or hyperscale customer in the world, with one in six built by AMD. At Meta, the most enthusiastic adopter of AMD's barely year-old accelerators, AMD accounted for 43 percent of GPU shipments at 173,000 versus Nvidia's 224,000. Meanwhile, at Oracle, AMD accounted for 23 percent of the database giant's 163,000 GPU shipments. Across four vendors tracked by Omdia—Microsoft, Meta, Oracle, and TensorWave—MI300X shipments totaled 327,000 units.

AMD's gains are particularly notable given that its MI300-series accelerators have only been on the market for a year, having previously been used primarily in high-performance computing applications like Oak Ridge National Laboratory's 1.35 exaFLOPS Frontier supercomputer. "They managed to prove the effectiveness of the GPUs through the HPC scene last year, and I think that helped," Vladimir Galabov, research director for cloud and datacenter at Omdia, told The Register. "I do think there was a thirst for an Nvidia alternative." The MI300X offers substantial technical advantages, including 1.3x higher floating point performance for AI workloads, 60 percent higher memory bandwidth, and 2.4x higher capacity than the H100. With 192 GB of HBM3 per GPU—enabling 1.5 TB of vRAM per single server—large models like Meta's Llama 3.1 405B can run on a single node, whereas an H100 lacks the necessary memory at full resolution. The MI300X provides 5.3 TBps of memory bandwidth, versus 3.3 TBps on the H100 and 4.8 TBps for the H200, making it theoretically capable of serving larger models faster. Even with Nvidia's Blackwell beginning to reach customers, AMD's MI325X maintains a capacity advantage at 256 GB per GPU, while its more powerful MI355X, slated for late next year, will reach 288 GB.

Microsoft and Meta have gravitated toward AMD's accelerators for their deployments of large frontier models measuring hundreds of billions or even trillions of parameters. This momentum is reflected in AMD's guidance, which has consistently improved quarter after quarter, with AMD now expecting Instinct to drive $5 billion in revenues in fiscal 2024. According to Galabov, "AMD executes well. It communicates well with clients, and it's good at talking about its strengths and its weaknesses transparently." Going forward, emerging GPU bit barns like CoreWeave represent potential growth drivers, with some purposely building business models around Nvidia alternatives—TensorWave serving as a notable example.

Beyond AMD's challenge to Nvidia, cloud providers and hyperscalers are deploying substantial quantities of custom AI silicon. Omdia estimates that Meta's custom MTIA accelerators reached 1.5 million shipments in 2024, while Amazon placed orders for 900,000 Inferentia chips. Google ordered approximately one million TPU v5e and 480,000 TPU v5p accelerators, having used TPUs to train its proprietary Gemini and open Gemma language models. Additionally, Amazon ordered about 366,000 Trainium chips, which have been retuned for both training and inference workloads. While Inferentia and MTIA are designed for traditional machine learning tasks like recommender systems rather than large language models, these custom silicon deployments reflect the industry's broader movement toward diversified AI infrastructure.

Read the original