Diffusion · AI Beat

23 Jun 2026 · AI Beat Desk

The Inpainting Model That Skipped the Attention

HUST's Moebius (0.22B) matches FLUX.1-Fill-Dev (11.9B) on six image inpainting benchmarks at 15× the inference speed. Two mechanisms make it work: Local-λ Mix Interaction blocks that replace quadratic spatial attention with fixed-size linear matrices, and adaptive multi-granularity latent-space distillation. For inpainting specifically, attention overhead appears to be the actual bottleneck — not parameter count. Weights are out.

12 Jun 2026 · AI Beat Desk

Text Diffusion Reaches Consumer Hardware

Google's DiffusionGemma 26B-A4B is a discrete text diffusion model that generates tokens in parallel blocks rather than left-to-right, hitting 1100+ tokens/sec on a single H100 and fitting in 18 GB of VRAM quantized. It's open under Apache 2.0 and marks the first time a production-quality diffusion LM from a major lab lands on consumer hardware — with real benchmark results showing what you trade away for that speed.

17 May 2026 · AI Beat Desk

One Minute of 720p World on One GPU

NVIDIA's SANA-WM generates 60-second, 720p video from a single image and a camera trajectory — on a single GPU. The open-source 2.6B-parameter model achieves 36× higher throughput than prior open-source world models and ships under Apache 2.0.

16 May 2026 · AI Beat Desk

The Draft Model You Don't Have to Train

Orthrus (arXiv 2605.12825) grafts a trainable diffusion head onto a frozen AR backbone, sharing the exact same KV cache. An intra-model consensus mechanism guarantees that every accepted token matches the AR distribution exactly — no approximation, no quality tradeoff — while achieving up to 7.8× speedup on Qwen3-8B with only O(1) memory overhead. The approach sidesteps the core operational cost of speculative decoding: maintaining a separate, carefully calibrated draft model.

15 Apr 2026 · AI Beat Desk

Diffusion LMs Finally Close the Quality Gap

A new paper from a mix of academic and industry researchers identifies why diffusion language models consistently trail their autoregressive counterparts despite strong theoretical properties: they don't agree with what they generate. The proposed fix — Introspective Strided Decoding — lets an 8B DLM match same-scale AR quality while running 2.9–4.1x faster at high concurrency.