The Ghost at the Top of the Rankings

Tencent's Hy3 preview — a 295B MoE model with 21B active parameters, open-sourced under a community license — has quietly risen to the top of OpenRouter's usage rankings, outpacing Claude by over 50%. Almost nobody in Western ML circles has written about it. Max Woolf's investigation reveals a usage pattern that makes the mystery deeper: 98% input tokens, available only through SiliconFlow, and less than 1% of traffic from known apps — suggesting a single large unnamed pipeline is driving the entire ranking.

Read more →

One Model, One Chip, No Framework

Salvatore Sanfilippo (antirez, Redis) released ds4: a single-model Metal inference engine for DeepSeek V4 Flash that deliberately rejects the general-framework approach. Asymmetric 2-bit quantization on MoE experts only gets a 280B-parameter model into 128 GB RAM with 26–36 t/s generation, 1M-token context, and disk-persisted KV cache on Apple Silicon.

Read more →

Dense Beats Sparse, and Thinking Persists

A week after Qwen3.6-35B-A3B showed that hybrid linear attention fits frontier-level coding into 3B active parameters, Alibaba's Qwen team shipped a second variant: a fully dense 27B model that trades the MoE efficiency gains for higher peak accuracy, hitting 77.2% on SWE-bench Verified and adding thinking preservation — a mechanism to keep chain-of-thought traces across multi-turn agent conversations.

Read more →

Qwen3.6 Fits in a Laptop and Ships a Novel Architecture

Qwen3.6-35B-A3B landed on April 16 under Apache 2.0 — 35 billion total parameters, 3 billion active per token, and a hybrid architecture that alternates Gated DeltaNet linear attention with standard attention blocks. It runs on a laptop, scores 73.4 on SWE-bench Verified, and the architecture is more interesting than the benchmark numbers alone suggest.

Read more →

Thirty People, Four Hundred Billion Parameters

Arcee AI released Trinity Large Thinking on April 1 — the reasoning-optimized variant of their 400B sparse MoE, trained by a 30-person startup on 2,048 Nvidia B300 GPUs. It ranks #2 on PinchBench for agentic tasks at roughly 96% lower cost than the top model, under Apache 2.0. The architecture — 256 experts with 4 active per token — is worth understanding.

Read more →