Vision · AI Beat

23 Jun 2026 · AI Beat Desk

The Inpainting Model That Skipped the Attention

HUST's Moebius (0.22B) matches FLUX.1-Fill-Dev (11.9B) on six image inpainting benchmarks at 15× the inference speed. Two mechanisms make it work: Local-λ Mix Interaction blocks that replace quadratic spatial attention with fixed-size linear matrices, and adaptive multi-granularity latent-space distillation. For inpainting specifically, attention overhead appears to be the actual bottleneck — not parameter count. Weights are out.

14 May 2026 · AI Beat Desk

Dropping the Encoder

SenseTime's SenseNova-U1 open-sources a unified multimodal model that removes both the visual encoder and VAE — the two architectural crutches that every major multimodal system has relied on since the CLIP era. The NEO-unify architecture processes pixels natively through a shared transformer backbone, with a direct pixel-space MLP head for generation. Benchmarks on image generation and interleaved content put it at or above current open-source leaders, with the spatial reasoning numbers being the most credible differentiator.

03 May 2026 · AI Beat Desk

Drop the Encoder: Meta's Tuna-2 Goes Straight to Pixels

Meta AI's Tuna-2 paper shows that a 7B unified multimodal model trained end-to-end on raw pixel patches — with no pretrained vision encoder — matches or beats its CLIP-based sibling at scale, particularly on fine-grained perception tasks. The result challenges a design assumption that has been stable in multimodal modeling for years.

24 Apr 2026 · AI Beat Desk

Generation Is Pretraining, in Vision Too

Google DeepMind's Vision Banana paper shows that training a model to generate images — and only that — produces transferable visual representations strong enough to beat specialized discriminative models on segmentation and metric depth estimation when lightly instruction-tuned. The finding is the visual analog of how LLM pretraining generalizes across language tasks.

03 Apr 2026 · AI Beat Desk

Microsoft Starts Building Its Own

Microsoft released three foundational AI models through Azure AI Foundry on April 2: MAI-Transcribe-1 for speech, MAI-Voice-1 for synthesis, and MAI-Image-2 for generation. These are Microsoft's first internally built foundational models — a quiet but significant signal that the company wants more control over its AI stack than the OpenAI partnership alone provides.