Compute & Cloud · Report

Meta declares operation of two new internal AI datacenters containing 24,000+ Nvidia H100 GPUs.

Quantifies hyperscaler captive infrastructure scale; validates massive H100 demand and high-bandwidth dataflow requirements.

Trade pressSlicast · March 14, 2024 · Global · Source: tweaktown.com

importance 80

Meta has announced an ambitious expansion of its AI infrastructure, planning to deploy 350,000 NVIDIA H100 GPUs by the end of 2024. As part of what the company describes as its "GenAI Infrastructure" roadmap, Meta unveiled two "24,576 GPU data center scale clusters" designed to support current and next-generation AI models, research, and development. This substantial investment reflects Meta's determination to lead in the competitive AI market, alongside comparable efforts by Microsoft, Google, and Amazon.

Each of the two clusters contains over 24,000 NVIDIA Tensor Core H100 GPUs configured with 400Gbps interconnect capability. The clusters employ different networking approaches: one uses Meta's own network fabric solution based on the Arista 7800, featuring Wedge400 and Minipack2 OCP rack switches, while the other deploys NVIDIA's more advanced Quantum2 fabric. This represents a significant upgrade from Meta's AI Research SuperCluster (RSC) from 2022, which housed 16,000 NVIDIA A100 GPUs.

Meta emphasized the performance benefits of its new infrastructure in a statement explaining that "the efficiency of the high-performance network fabrics within these clusters, some of the key storage decisions, combined with the 24,576 NVIDIA Tensor Core H100 GPUs in each, allow both cluster versions to support models larger and more complex than that could be supported in the RSC and pave the way for advancements in GenAI product development and AI research."

The company framed its hardware investments as foundational to its competitive strategy, stating: "To lead in developing AI means leading investments in hardware infrastructure." Meta articulated a broader vision centered on artificial general intelligence, declaring that "Meta's long-term vision is to build artificial general intelligence (AGI) that is open and built responsibly so that it can be widely available for everyone to benefit from."

Looking forward, Meta indicated a commitment to continuous infrastructure evolution. The company stated: "We are constantly evaluating and improving every aspect of our infrastructure, from the physical and virtual layers to the software layer and beyond. Our goal is to create systems that are flexible and reliable to support the fast-evolving new models and research."

Read the original