Chips & Hardware · Report

Nvidia's Blackwell B200 achieves 4X performance improvement over H100 in MLPerf benchmarks using FP4 precision.

Demonstrates significant generational leap in GPU performance, enabling denser AI workloads per infrastructure footprint.

Trade pressSlicast · August 28, 2024 · Global · Source: tomshardware.com

importance 85

Nvidia has published the first MLPerf 4.1 results for its Blackwell B200 processor, demonstrating that a single Blackwell GPU delivers up to four times the performance of its H100 predecessor based on the Hopper architecture. According to Nvidia's results, the B200 achieves 10,755 tokens per second on a single GPU in a server inference test and 11,264 tokens per second in an offline reference test. Public MLPerf Llama 2 70B benchmark results show that a 4-way Hopper H100-based machine delivers similar performance, supporting Nvidia's claim that a single Blackwell processor is approximately 3.7X–4X faster than a single Hopper H100 GPU. However, several important caveats affect the validity of this comparison.

The performance differences between the two processors stem from multiple technical factors. Most significantly, Nvidia's Blackwell processor uses FP4 precision in its fifth-generation Tensor Cores, whereas Hopper-based H100 only supports and uses FP8. While MLPerf guidelines allow these different formats, FP4 performance in Blackwell doubles its FP8 throughput, representing a fundamental advantage in the comparison. Additionally, Nvidia compared a single B200 against four H100 GPUs—a somewhat unfair pairing since scaling is never perfect and a single-GPU result tends to represent a best-case scenario for per-GPU performance.

The architectural differences extend to memory specifications and capacity. The tested B200 GPU carries 180GB of HBM3E memory, while the H100 SXM has 80GB of HBM (up to 96GB in some configurations) and the H200 has between 96GB of HBM3 and up to 144GB of HBM3E. When a single H200 with 96GB HBM3 was tested, it achieved only 3,114 tokens per second in offline mode, whereas a single H200 with larger memory achieved 4,488 tokens per second. This means the B200 is only 2.5X faster in the comparison to H200, a more modest gain than Nvidia's headline claim.

The H200 with 141GB of HBM3E memory performed exceptionally well not only in the generative AI benchmark featuring Llama 2 70B, but also in every single test within the datacenter category, benefiting significantly from its greater memory capacity in tests that leverage that advantage. Notably, Nvidia has only shared B200 performance data for the Llama 2 70B generative AI benchmark within MLPerf 4.1, which contains nine core disciplines total. Whether this limited disclosure reflects ongoing tuning efforts or other factors remains unclear, but the full capabilities of the Blackwell B200 across the other eight test categories remain unknown.

Read the original