Headlines · Report

Tesla purchases 10,000 Nvidia H100 GPUs despite developing in-house Dojo training infrastructure.

Suggests Tesla's custom silicon faces delays or capability gaps, validating strong Nvidia demand even from companies building competing architectures.

Trade pressSlicast · August 30, 2023 · Global · Source: theregister.com

importance 70

Tesla has revealed a substantial expansion of its AI infrastructure, deploying a 10,000 GPU compute cluster that came online Monday, as announced by Tesla AI Engineer Tim Zaman over the weekend. The system is designed to process data collected from Tesla vehicles and accelerate development of the company's full self-driving (FSD) capabilities, which Tesla has been promising since 2016. To date, the automaker has delivered only driver assistance features described as "super-cruise-control" that require human oversight, falling short of true autonomous driving. Tesla declined to provide further comment on the new deployment.

This latest investment follows Tesla's announcement last month of a $1 billion commitment to build out its Dojo supercomputer through the end of 2024 to speed autonomous driving software development. Dojo utilizes Tesla's proprietary 15kW training tiles, with six tiles comprising a one-exaFLOPS (BF16) Dojo V1 system. Each tile contains D1 chip dies designed by Tesla and manufactured by TSMC. CEO Elon Musk has articulated the strategic rationale behind maintaining multiple paths: "We'll actually take the hardware as fast as Nvidia will deliver it to us. If they could deliver us enough GPUs, we might not need Dojo, but they can't because they've got so many customers."

The new 10,000 GPU cluster represents a significant scale-up from Tesla's previous deployments. In 2021, Tesla operated a 720 GPU node cluster with eight A100 accelerators per node, totaling 5,760 GPUs and delivering 1.8 exaFLOPS of FP16 performance. The current system is nearly twice as large and utilizes Nvidia's latest H100 GPUs, which offer roughly three times the FP16 performance of their predecessors and include FP8 mathematical support. Using HGX chassis with SXM5 H100 modules, the system comprises approximately 1,250 nodes with eight GPUs each, delivering 39.5 exaFLOPS of FP8 performance. The installation includes a hot tier cache capacity of more than 200 petabytes.

Critically, Tesla owns and operates this infrastructure on-premises at its own facilities rather than renting capacity from cloud providers like Microsoft or Google. Zaman emphasized this distinction: "Many orgs say 'We have' which usually means 'We rented' few actually own, and therefore fully vertically integrate. This bothers me because owning and maintaining is hard. Renting is easy." Tesla's vertical integration strategy appears to be extending to datacenter expansion; the company recently posted a job opening for a senior engineering program manager for datacenters tasked with leading "the end-to-end design and engineering of Tesla's first of its kind datacenters," suggesting plans for new facility construction to accommodate additional computational capacity.

Read the original