Dropping the Encoder

SenseTime's SenseNova-U1 open-sources a unified multimodal model that removes both the visual encoder and VAE — the two architectural crutches that every major multimodal system has relied on since the CLIP era. The NEO-unify architecture processes pixels natively through a shared transformer backbone, with a direct pixel-space MLP head for generation. Benchmarks on image generation and interleaved content put it at or above current open-source leaders, with the spatial reasoning numbers being the most credible differentiator.

Read more →

Drop the Encoder: Meta's Tuna-2 Goes Straight to Pixels

Meta AI's Tuna-2 paper shows that a 7B unified multimodal model trained end-to-end on raw pixel patches — with no pretrained vision encoder — matches or beats its CLIP-based sibling at scale, particularly on fine-grained perception tasks. The result challenges a design assumption that has been stable in multimodal modeling for years.

Read more →

Generation Is Pretraining, in Vision Too

Google DeepMind's Vision Banana paper shows that training a model to generate images — and only that — produces transferable visual representations strong enough to beat specialized discriminative models on segmentation and metric depth estimation when lightly instruction-tuned. The finding is the visual analog of how LLM pretraining generalizes across language tasks.

Read more →

Microsoft Starts Building Its Own

Microsoft released three foundational AI models through Azure AI Foundry on April 2: MAI-Transcribe-1 for speech, MAI-Voice-1 for synthesis, and MAI-Image-2 for generation. These are Microsoft's first internally built foundational models — a quiet but significant signal that the company wants more control over its AI stack than the OpenAI partnership alone provides.

Read more →