NVIDIA secured its sixth consecutive first-place finish in MLPerf AI inference benchmarks.
The H100 (Hopper-based) GPU demonstrates substantial performance gains over prior generations, achieving up to four times faster performance than the NVIDIA A100 on the newly released MLPerf V2.1 benchmark suite. For Natural Language Processing specifically, the H100's Transformer Engine and Software showed better than four times the performance of the A100, handily beating the Chinese-built Biren BR 104 GPU. Consistent with successive generations of NVIDIA GPUs, the Hopper GPU is 1.5x to 2.5x the performance of the Ampere A100 on most AI benchmarks, while the NVIDIA Jetson AGX Orin leads in edge computing. NVIDIA captured the performance leadership for inference processing with this release, having previously lost the absolute performance crown during the last training cycle while awaiting the H100.
NVIDIA's mature software stack proved decisive, enabling the company to submit results for all 7 inference benchmarks while competitors cherry-pick one or two benchmarks they have been tuning for customer projects. This breadth of capability matters significantly for cloud service providers, since multi-model inference processing is quickly becoming critical. A simple user query—"what kind of flower is this?"—can decompose into nine AI models working together, requiring the answer in 10 milliseconds. The fungibility of a processor across a large number of workloads greatly simplifies IT management, reduces capital expenses, and improves utilization rates dramatically, increasing profits for cloud service providers. The NVIDIA Jetson Orin demonstrated a 50% performance improvement that translates directly to a 50% improvement in power efficiency.
The benchmarking cycle included notable new submissions. SAPEON X220-Enterprise from SK-Telecom and the Biren BR104 GPU from China both submitted for the first time, with the latter showing impressive performance for image processing with just 32GB of HBM2E memory. However, several major vendors—AMD, AWS, Groq, SambaNova, and Tenstorrent—remain absent from MLPerf submissions, in contrast to Intel, Google, and Graphcore, which have frequently submitted training results. The author notes that these absent vendors clearly run these models internally for sales efforts and performance optimization, and calls for greater openness and transparency from the ML community.
On the power efficiency dimension, the Qualcomm Cloud AI 100 bested all submissions, albeit only for image processing and NLP (BERT), proving that while NVIDIA wins the performance race, Qualcomm wins the power efficiency battle. Recent adoption of the Cloud AI 100 by server vendors including Dell, HPE, and Lenovo indicates significant customer interest in the Qualcomm platform for edge deployments. The author suspects that many absent vendors may adhere to the principle of not releasing benchmarks they did not win, suggesting their chips may not be as performant as their marketing claims or as competitive with NVIDIA and Qualcomm offerings.