Data Centers · Report

Meta disclosed its Catalina AI pod architecture coupling Nvidia Blackwell GB200 NVL72 GPUs with Open Rack v3 and liquid cooling.

Hyperscaler deployment details validate Blackwell's architectural fit for liquid-cooled dense clusters, setting industry design patterns for power-constrained datacenters.

Trade pressSlicast · August 25, 2025 · Global · Source: wccftech.com

importance 75

Meta has shared the building blocks of its Catalina AI system, which is based on NVIDIA's GB200 NVL72 solution with Open Rack v3 & Liquid Cooling. Meta started the Catalina project early with NVIDIA, utilizing their NVL72 GPU solution as the baseline, though it switches to an NVL36x2 configuration. Meta worked with NVIDIA to customize the system to their needs, and both companies contributed the reference design for MGX and NVL72 to open source, with Catalina being available on the Open Compute website.

Meta's GPU cluster requirements have grown dramatically over time. In 2022, Meta focused mainly on clusters around 6,000 GPUs designed for traditional ranking and recommendation models running workloads spanning 128-512 GPUs. A year later, thanks to GenAI and LLMs, clusters grew to 16-24K GPUs—a 4x increase. Last year, Meta was running 100,000 GPUs and continues to add more. As a software enabler with models such as Llama, Meta anticipates a 10x increase in cluster sizes in the next few years.

In Catalina, Meta calls each system a "pod" and copies/pastes it for scale-up purposes. Each pod consists of two IT racks, each containing a single 72 GPU scale-up domain with identical configurations. Each IT rack has 18 compute trays split between top and bottom, and nine NV switches within each rack on the left and right. Large air-assisted liquid cooling devices (ALCs) on the racks' left and right allow Meta to deploy liquid-cooled, high-power density racks into existing data centers throughout the US and worldwide.

Meta's two-rack configuration increases the number of CPUs and total memory within a rack, going from 17 to 34 TB LPDDR memory, which enables 48 TB of total cache-coherent memory between GPUs and CPUs within a rack. The PSU converts 480 volts or 277 volts single-phase to 48 volts DC, distributed through the buck bar to power individual server blades, NV switches, and networking devices. Meta's high-powered rack version of OpenRack v3 allows up to 94 kW for the busbar (600A), supporting newer buildings with facility liquid cooling. The Rack Management Controller (RMC) constantly monitors rack components for leaks and manages the air-assisted liquid cooling systems and facility-level valve trains.

To connect multiple pods into larger clusters, Meta uses its own disaggregated scheduled fabric, which allows them to connect multiple pods within a single data center building or suite, across multiple buildings, and potentially larger scales to provide large-scale clusters. This fabric is tuned for AI and helps provide flexibility and speed, essentially enabling all the GPUs to communicate with each other across the entire system.

Read the original