Chips & Hardware · Report

Nvidia released Dynamo, an operating system designed specifically for AI infrastructure workloads.

Represents Nvidia's vertical integration into the software stack, building competitive moat in AI infrastructure.

Trade pressSlicast · March 23, 2025 · Global · Source: theregister.com

importance 70

Nvidia announced Dynamo, an open source inference suite, at its GPU Technology Conference this week, with CEO Jensen Huang describing it as the "operating system of an AI factory." Drawing comparisons to the historical dynamo that sparked an industrial revolution, Huang explained the metaphor: "The dynamo was the first instrument that started the last industrial revolution. The industrial revolution of energy — water comes in, electricity comes out." The framework is designed to optimize inference engines such as TensorRT LLM, SGLang, and vLLM to run across large quantities of GPUs as quickly and efficiently as possible.

LLM output performance breaks into two categories: prefill and decode. Prefill is determined by how quickly the GPU's floating-point matrix math accelerators can process the input prompt, with longer prompts taking correspondingly longer. Decode represents how quickly GPUs produce tokens in response to a user's prompt. Memory bandwidth is critical to decode performance—a GPU with 8TB/s of memory bandwidth will generate tokens more than twice as fast as one with 3.35TB/s. The challenge intensifies when serving larger models to more users with longer input and output sequences, as large models are typically distributed across multiple GPUs in ways that significantly impact performance and throughput.

The distribution approach determines where a system falls on what Huang described as a performance curve. As Huang explained, "Under the Pareto frontier are millions of points we could have configured the datacenter to do. We could have parallelized and split the work and sharded the work in a whole lot of different ways." One configuration might serve millions of concurrent users at only 10 tokens per second each, while another serves only a few thousand concurrent requests but generates hundreds of tokens rapidly. Finding the optimal balance between individual performance and maximum throughput is a key capability Dynamo provides.

Dynamo includes a GPU planner that determines how many accelerators should be dedicated to prefill and decode based on demand, while disaggregating these tasks onto different accelerators. The framework features prompt routing functionality that directs overlapping requests to specific GPU groups to maximize key-value cache hits, plus a low-latency communication library for GPU-to-GPU data flows and a memory management subsystem that moves KV cache data between HBM, system memory, and cold storage to maximize responsiveness. For Hopper-based systems running Llama models, Nvidia claims Dynamo effectively doubles inference performance; for larger Blackwell NVL72 systems, the company claims a 30x advantage in DeepSeek-R1 over Hopper with the framework enabled.

Though optimized for Nvidia's hardware and software stacks, Dynamo is designed to integrate with popular serving libraries including vLLM, PyTorch, and SGLang, allowing users in heterogeneous compute environments with AMD or Intel accelerators to continue using their existing inference engines. The framework runs on any Nvidia GPU going back to Ampere, enabling users still operating A100s to benefit from the new software. Nvidia has released Dynamo instructions on GitHub and will offer it as a container image.

Read the original