A Trillion Parameters at a Thousand Tokens Per Second

Xiaomi and TileRT published MiMo-V2.5-Pro-UltraSpeed on June 8, pushing a one-trillion-parameter model past 1000 tokens per second on a single standard 8-GPU node — no custom silicon, just three carefully chosen co-design decisions applied to a commodity cluster.

Read more →

Speculative Decoding Has an Acceptance Problem You Can Exploit

Mistletoe (arXiv 2605.14005) demonstrates a stealthy adversarial attack on speculative decoding systems: craft inputs that look normal to the target model but cause the draft model to disagree, collapsing acceptance length and throughput while leaving output quality and perplexity unchanged. The attack exploits the fundamental gap between draft and target distributions that all speculative systems rely on bridging.

Read more →

The Draft Model You Don't Have to Train

Orthrus (arXiv 2605.12825) grafts a trainable diffusion head onto a frozen AR backbone, sharing the exact same KV cache. An intra-model consensus mechanism guarantees that every accepted token matches the AR distribution exactly — no approximation, no quality tradeoff — while achieving up to 7.8× speedup on Qwen3-8B with only O(1) memory overhead. The approach sidesteps the core operational cost of speculative decoding: maintaining a separate, carefully calibrated draft model.

Read more →

Gemma 4 Gets Speculative Decoding That Ships

Google ships multi-token prediction draft models for the full Gemma 4 family under Apache 2.0, reporting up to 3x throughput gains. The architecture is tightly coupled — shared embeddings, last-layer activations — which keeps the drafter accurate but limits reuse. MoE variants complicate the picture.

Read more →