Data Centers · Report

Elon Musk's xAI is doubling the Colossus supercluster to 200,000 Nvidia Hopper GPUs, with future plans to reach 300,000.

Massive private GPU cluster buildout demonstrates capital velocity in hyperscale infrastructure and increases competition for Nvidia GPU supply allocation.

Trade pressSlicast · October 29, 2024 · Global · Source: tomshardware.com

importance 88

Billionaire Elon Musk announced on X that his xAI data center is set to double its firepower "soon." This follows a recent video by TechTuber ServeTheHome showcasing the xAI Colossus AI supercomputer, which features gleaming rows of Supermicro servers packed with 100,000 state-of-the-art Nvidia enterprise GPUs. According to Musk, the Colossus is on course to "soon become a 200k H100/H200 training cluster in a single building." The 100,000 GPU incarnation only just started AI training about two weeks ago, though Musk's prior tech timing slippages—including Tesla's full self-driving, Hyperloop delays, and SolarCity struggles—suggest caution regarding his forward-looking boasts.

The xAI Colossus has been dubbed an engineering marvel, with praise extending beyond Musk's usual circle. Nvidia CEO Jensen Huang described the supercomputer project as a "superhuman" feat that had "never been done before." xAI engineers accomplished a remarkable achievement by setting up the supercomputer in 19 days, compared to the typical four years such projects of this scale and complexity normally require.

The 200,000 power-hungry H100/H200 GPUs will likely be used to train AI models and chatbots like Grok 3. This expansion is far from the hardware endgame for xAI Colossus, as Musk previously touted a Colossus with 300,000 Nvidia H200 GPUs. At the current pace, Musk could tweet about reaching this 300,000 goal before the end of 2024.

The journey to these ambitious targets faces several technical hurdles. Colossus 1's inefficient mixed-architecture design proved unsuitable for training Grok, so Anthropic is using it for inference instead. Meanwhile, SpaceX has rented access to its 220,000 Nvidia GPUs and 300 megawatts of AI compute power to rival Anthropic, and is unveiling an 11-million-square-foot Gigasat factory as a new manufacturing facility for space-based data centers. Potential delays could stem from GPU supply constraints and infrastructure challenges, including on-site power generation upgrades needed even for stage 1, alongside complex liquid cooling and networking hardware requirements.

Read the original