Text Diffusion Reaches Consumer Hardware

Autoregressive language models have a fundamental hardware problem: they generate one token at a time, and each step requires loading the full set of model weights through GPU memory. That makes generation memory-bandwidth-bound rather than compute-bound, leaving most of the GPU’s arithmetic capacity idle. Speculative decoding, parallel verification schemes, and MoE sparsity are all attempts to claw some of that idle compute back. DiffusionGemma, released by Google DeepMind on June 10, takes a different route: change the generation algorithm entirely.

How it works

Rather than predicting tokens left-to-right, DiffusionGemma starts with a canvas of placeholder tokens and iteratively denoises them in parallel. This is the discrete diffusion approach applied to text — conceptually similar to how image diffusion models like Stable Diffusion start with noise and refine it, but adapted for the discrete token space that language models inhabit. The key architectural payoff is bidirectional attention: during denoising, every token can attend to every other token on the canvas simultaneously. Autoregressive models can only attend leftward (to tokens already committed), which constrains certain kinds of reasoning. A task like filling in a constrained template, writing a Sudoku solution, or generating structured data where the end affects the middle is structurally easier with bidirectional access.

For sequences longer than a single generation block, DiffusionGemma uses a block-autoregressive scheme: once a block is fully denoised and committed, it goes into the KV cache and the next block starts. This is a pragmatic compromise — pure parallel generation across arbitrarily long sequences would require keeping the whole context in a noisy, uncommitted state.

The model itself is built on the Gemma 4 26B-A4B MoE backbone: 25.2B total parameters with 3.8B active per token (8 of 128 experts routed per inference step). Quantized, it fits within 18 GB of VRAM — within reach of an RTX 4090 or RTX 5090.

The speed numbers

On a single NVIDIA H100, DiffusionGemma hits over 1100 tokens per second. On an RTX 5090 it reaches 700+ tokens/sec. Compared to its autoregressive counterpart, that’s roughly a 4× throughput improvement. The reason this works is that the diffusion process converts the memory-bandwidth bottleneck into a compute-bound problem — the GPU’s tensor cores have actual work to do across the full token canvas at each denoising step.

The quality tradeoff

The benchmark numbers are honest and Google didn’t try to hide them. Against autoregressive Gemma 4 26B-A4B:

Benchmark	DiffusionGemma	Gemma 4
MMLU Pro	77.6%	82.6%
AIME 2026	69.1%	88.3%
MMMU Pro (vision)	54.3%	73.8%
Codeforces ELO	1429	1718

The gap is real and substantial — nearly 20 points on AIME and almost 300 Codeforces ELO points. For hard reasoning and competitive programming, DiffusionGemma is considerably weaker. Google’s own model card says: for maximum quality, use autoregressive Gemma 4. This model is for latency-sensitive applications where you can accept lower peak accuracy in exchange for raw throughput.

What it might actually be good for is the class of tasks where bidirectional attention matters more than sequential coherence: structured generation, constraint satisfaction, and scenarios where you need a fast first draft and can verify or filter output externally. The Sudoku fine-tuning result — 80% success after reducing inference steps from 48 to 12 — illustrates this: the model’s global context awareness made it easier to specialize for a constraint-heavy task.

Why this matters

Text diffusion has been in research papers for a while but has been slower to arrive in practice than image diffusion. The architecture requires rethinking how generation works at a fundamental level, and the quality gap versus autoregressive models has been a persistent obstacle. DiffusionGemma doesn’t close that gap — but it puts an open-weights, Apache 2.0 licensed, consumer-GPU-runnable text diffusion model in front of anyone who wants to experiment with it. The underlying approach (and the bidirectional attention advantage for constrained generation) is interesting enough that research built on top of it will follow quickly.

Whether discrete diffusion ever closes the quality gap enough to challenge autoregressive models at frontier capability levels remains an open question. What’s clear from the benchmark numbers is that we’re not there yet. But having a real production model with documented performance numbers and open weights is a better starting point for that answer than another research paper alone.