The MLPerf Inference v6.0 results landed April 2, and the headline number is worth sitting with: NVIDIA achieved 250,634 tokens/second on the Interactive DeepSeek-R1 benchmark — up from roughly 90,000 tokens/second in the v5.1 submission six months ago. That’s a 2.77x improvement. On the same B200 hardware.
No new chips. No architectural change to the model. Just software.
What actually changed
Four techniques drove the improvement:
Disaggregated prefill/decode serving, implemented through NVIDIA’s Dynamo framework. Prefill and decode are fundamentally different compute workloads: prefill is compute-bound (processing the prompt), decode is memory-bandwidth-bound (generating tokens one at a time). Running them on the same GPU pool forces compromises in batching and scheduling. Disaggregation routes each stage to separately optimized GPU pools, letting each phase operate closer to its theoretical optimum.
Kernel fusion combines multiple CUDA kernels — attention, layernorm, FFN projections — that would previously execute sequentially into single larger kernels. This reduces kernel launch overhead and improves data locality; the fused output stays on-chip rather than round-tripping to HBM between operations.
Kernel overlapping pipelines execution so the GPU isn’t idle between kernels. While one operation is executing, the next is being prepared. This is particularly valuable for MoE models like DeepSeek-R1 where expert routing creates irregular execution patterns.
Multi-token prediction in TensorRT-LLM lets the model draft multiple tokens per forward pass, reducing the number of passes required per generated token. This trades a bit of per-pass compute for fewer passes total — a good deal when the bottleneck is memory bandwidth rather than compute.
Together these cut the effective token cost on DeepSeek-R1 to $0.30/M tokens, down substantially from the v5.1 baseline.
Why this matters
The narrative around AI infrastructure spending tends to focus on hardware — GPU counts, chip generations, cluster sizes. The MLPerf trajectory is a useful corrective. Software optimization has been compounding at a rate that’s easy to underestimate.
This isn’t new. The Chinchilla scaling laws showed that training compute allocation had been badly suboptimal for years. FlashAttention, continuous batching, and speculative decoding each delivered step-change improvements in inference efficiency after the fact. The pattern repeats: we build systems, deploy them, and then discover a significant fraction of available performance was being left on the table.
Disaggregated serving is probably the most consequential of the four techniques listed. It’s a genuine architectural change to how inference infrastructure is organized, not just a tuning improvement. Production systems that haven’t adopted it are likely running significantly below their hardware potential — and the gap is measurable. NVIDIA’s own Dynamo framework makes the pattern accessible; the integration work is real but the playbook is written.
The 2.77x number also puts hardware upgrade cycles in perspective. If the v6.0 improvement on B200 hardware exceeds what you’d get from switching to B200 from the previous generation in the first place, the ROI calculation on whether to upgrade hardware versus optimize software gets more complicated. For teams running serious inference workloads, that’s worth modeling explicitly rather than assuming new hardware always wins.
MLPerf benchmarks have their limits — they measure point-in-time throughput on specific workloads, not the full complexity of production serving. But as a rough gauge of where the optimization frontier currently sits, a 2.77x gain from software in six months is a meaningful signal.
