Speculative decoding has become the default inference trick in production LLM serving. The recipe is well-understood: run a small draft model to propose a block of tokens, verify them with the target model in a single forward pass, accept what matches, and correct where it doesn’t. Done well, you get near-linear throughput gains on output-heavy workloads. Done poorly, you get a draft model that needs retraining every time the target model is updated, a separate model to load into memory, and acceptance rates that fall apart on distribution shifts.

Orthrus, out of a team at arXiv this week, asks whether the draft head needs to be a separate model at all.

Two Heads, One Cache

The core idea is that a frozen AR backbone already contains all the contextual representations you need to propose parallel tokens — they’re sitting in the KV cache. Orthrus augments the backbone with a lightweight diffusion head (three new projection matrices: WQdiffW_Q^\text{diff}, WKdiffW_K^\text{diff}, WVdiffW_V^\text{diff}) that attends to the same KV representations as the AR head, initialized from the AR counterparts. The diffusion head takes an anchor token plus K1K-1 masked tokens and produces KK candidates in a single forward pass. These candidates are then routed through the frozen AR head to get exact causal probabilities in a second pass.

The memory cost of this is constant: ~4.5 MiB regardless of sequence length. There’s no second model’s KV cache to maintain, no separate process, no model registry to keep in sync. Trainable parameters add up to about 16% of the backbone — and training requires less than 1 billion tokens on a single 8×H200 node.

The Lossless Guarantee

What distinguishes Orthrus from diffusion-language-model approaches like MDLM or Fast-dLLM-v2 is the consensus mechanism. After the diffusion head proposes KK tokens, each candidate is validated against the AR greedy prediction. The system commits accepted tokens and corrects the first mismatch with the exact AR output. For non-greedy (temperature > 0) generation, rejection sampling is applied to ensure strict alignment with the target distribution. The result: zero degradation on any benchmark. MATH-500 accuracy is identical to the baseline, GSM8K is identical, HumanEval is identical. You can verify this because the distribution is mathematically identical — the AR path is the arbiter, not an approximation.

This matters in practice. Existing speculative decoding systems with separate draft models get acceptance rates of perhaps 70–90% in the best case, and those rates drop when the model is fine-tuned or when the input distribution drifts. A deployment running EAGLE-3 against a base model has to retrain or re-calibrate the drafter when the backbone changes. Orthrus doesn’t have this problem: the diffusion head was trained to agree with its host, and the consensus mechanism guarantees agreement at inference time regardless.

Numbers

On Qwen3-8B, Orthrus achieves an average of 5.39 tokens per forward pass (versus the theoretical AR maximum of 1.0) and a mean 5.36× wall-clock speedup across coding, math, and reasoning tasks. The peak speedup of 7.8× occurs on tasks where the model tends toward repetitive or highly predictable outputs — exactly the case where long acceptance runs are most achievable. Compared to EAGLE-3 and DFlash, two strong speculative decoding baselines, Orthrus matches or exceeds acceptance lengths while eliminating the separate model overhead.

The authors test across Qwen3-1.7B, 4B, and 8B to validate that the approach scales within the model family. Results hold across sizes.

What This Doesn’t Solve

Orthrus requires two forward passes per generation step: one for the diffusion head to propose, one for the AR head to verify. The net speedup relies on these two passes being faster than KK sequential AR passes — which is true when the batch is memory-bandwidth-bound (the common case on modern hardware), but the arithmetic changes on compute-bound workloads or very short KK. The paper doesn’t characterize failure modes on compute-bound configurations.

Training the diffusion head also requires access to the backbone’s internals at a fairly low level — this isn’t a pure inference-time trick you can apply to a model you’re querying via API. It’s a modification requiring model weights.

None of this undermines the core result. The implicit cost of keeping a separate draft model calibrated, versioned, and in sync with the target model is real — every team operating a production serving stack with speculative decoding knows this. Orthrus’s bet is that folding the draft into the backbone, at the cost of a modest training run, beats the ongoing operational overhead. Based on the numbers, it’s a reasonable bet.

The code and Qwen3-8B checkpoint are on GitHub.