The Ghost at the Top of the Rankings

Tencent's Hy3 preview — a 295B MoE model with 21B active parameters, open-sourced under a community license — has quietly risen to the top of OpenRouter's usage rankings, outpacing Claude by over 50%. Almost nobody in Western ML circles has written about it. Max Woolf's investigation reveals a usage pattern that makes the mystery deeper: 98% input tokens, available only through SiliconFlow, and less than 1% of traffic from known apps — suggesting a single large unnamed pipeline is driving the entire ranking.

Read more →

Dropping the Encoder

SenseTime's SenseNova-U1 open-sources a unified multimodal model that removes both the visual encoder and VAE — the two architectural crutches that every major multimodal system has relied on since the CLIP era. The NEO-unify architecture processes pixels natively through a shared transformer backbone, with a direct pixel-space MLP head for generation. Benchmarks on image generation and interleaved content put it at or above current open-source leaders, with the spatial reasoning numbers being the most credible differentiator.

Read more →

Needle: What a 26M-Parameter Model Says About Tool Calling

Cactus Compute released Needle, a 26M-parameter MIT-licensed model for on-device function calling that strips out all feed-forward networks from the transformer. The architectural choice is a thesis: tool calling is retrieval-and-routing, not reasoning, and attention is the right primitive for it. The numbers are striking — 6000 tok/s prefill on consumer hardware — even if the playground has rough edges.

Read more →

Gemma 4 Gets Speculative Decoding That Ships

Google ships multi-token prediction draft models for the full Gemma 4 family under Apache 2.0, reporting up to 3x throughput gains. The architecture is tightly coupled — shared embeddings, last-layer activations — which keeps the drafter accurate but limits reuse. MoE variants complicate the picture.

Read more →

IBM's Quality Bet: 8B Dense Beats the 32B MoE

IBM's Granite 4.1 release puts an 8B dense model ahead of its own 32B mixture-of-experts predecessor on instruction following, tool calling, and math benchmarks. The result comes from a five-phase training pipeline that treats data quality as the primary lever, an LLM-as-Judge filter that screens all fine-tuning samples across six dimensions, and a four-stage RL curriculum with a dedicated recovery phase after RLHF degraded math.

Read more →

The Model That Stopped at 1930

Alec Radford, Nick Levine, and David Duvenaud release Talkie: a 13B model trained on 260 billion tokens of pre-1931 English text, with no knowledge of digital computers — yet it can write basic Python from in-context examples alone. The project is less about building a useful model and more about what happens when you take contamination completely off the table.

Read more →

Dense Beats Sparse, and Thinking Persists

A week after Qwen3.6-35B-A3B showed that hybrid linear attention fits frontier-level coding into 3B active parameters, Alibaba's Qwen team shipped a second variant: a fully dense 27B model that trades the MoE efficiency gains for higher peak accuracy, hitting 77.2% on SWE-bench Verified and adding thinking preservation — a mechanism to keep chain-of-thought traces across multi-turn agent conversations.

Read more →

Claude 4.7's Quiet Migration Tax

Claude Opus 4.7 shipped April 16 with an unchanged sticker price, but the real migration cost is higher than the headline: a new tokenizer quietly inflates token counts by 20–35% on code and technical text, and three commonly-used sampling parameters—temperature, top_p, top_k—now return a 400 error instead of being silently ignored.

Read more →

Qwen3.6 Fits in a Laptop and Ships a Novel Architecture

Qwen3.6-35B-A3B landed on April 16 under Apache 2.0 — 35 billion total parameters, 3 billion active per token, and a hybrid architecture that alternates Gated DeltaNet linear attention with standard attention blocks. It runs on a laptop, scores 73.4 on SWE-bench Verified, and the architecture is more interesting than the benchmark numbers alone suggest.

Read more →

Thirty People, Four Hundred Billion Parameters

Arcee AI released Trinity Large Thinking on April 1 — the reasoning-optimized variant of their 400B sparse MoE, trained by a 30-person startup on 2,048 Nvidia B300 GPUs. It ranks #2 on PinchBench for agentic tasks at roughly 96% lower cost than the top model, under Apache 2.0. The architecture — 256 experts with 4 active per token — is worth understanding.

Read more →

One Bit All the Way Down

PrismML launched Bonsai on March 31, claiming the first commercially viable true 1-bit LLMs: an 8B model that fits in 1.15 GB and runs at 131 tokens/sec on an M4 Pro. The key word is "true" — every layer, including embeddings and attention, is 1-bit, not just the weights in isolation.

Read more →

Microsoft's Harrier Embeds 32K Tokens at Once

Microsoft released Harrier-OSS-v1, a family of decoder-only multilingual embedding models (270M, 0.6B, 27B) with a 32,768-token context window — roughly 30–60x longer than the 512–1,024 token ceiling most practitioners hit today. The 27B model takes SOTA on Multilingual MTEB v2 at 74.3; all three variants are MIT licensed.

Read more →