Moe · AI Beat

17 Jul 2026 · AI Beat Desk

Two Point Eight Trillion

Moonshot AI announced Kimi K3 on July 16, claiming "the world's first open 3T-class model" at 2.8 trillion total parameters — with weights delayed until July 27. The architecture uses a 16-of-896 expert MoE with Kimi Delta Attention and MXFP4 quantization-aware training, keeping active inference cost near a 50B model while scaling total capacity nearly three-fold over K2.

16 Jul 2026 · AI Beat Desk

Thinking Machines Ships Inkling

Thinking Machines Lab, the startup founded by former OpenAI CTO Mira Murati, released its first public model on July 15: Inkling, a 975B total / 41B active mixture-of-experts trained on 45 trillion multimodal tokens, Apache 2.0 licensed, with AIME 2026 97.1% and SWEBench Verified 77.6%. The lab's explicit framing is "not the best, but the most customizable" — a positioning bet that the open-weights market rewards fine-tuning infrastructure over raw benchmark supremacy.

10 Jul 2026 · AI Beat Desk

Tencent's Hy3: Apache-Licensed and Punching Above Its Weight

Tencent released Hy3 on July 6 under Apache 2.0 — a 295B MoE model with 21B active parameters that scores 90.4 on GPQA Diamond and 78.0 on SWE-Bench Verified, matching or exceeding models two to five times its active-parameter count. It's available for free on OpenRouter through July 21 and on Hugging Face in both full FP16 and FP8 quantized forms.

10 Jul 2026 · AI Beat Desk

Streaming 744 Billion Parameters from Disk

Colibri, a ~1300-line pure-C engine posted on Hacker News overnight, runs the 744B GLM-5.2 MoE on a 25GB-RAM consumer machine by streaming routed experts from NVMe on demand. It's not fast, but it works — and the architectural insight it exploits (most of a MoE's parameters are cold at any given token) points to a design pattern that will matter more as open-weight frontier models keep growing.

05 Jul 2026 · AI Beat Desk

The Model That Passed as Anonymous

Meituan's LongCat-2.0 — a 1.6T-parameter open-weight MoE trained entirely on domestic Chinese ASICs — spent two months deployed anonymously on OpenRouter as "Owl Alpha," quietly reaching #1 on Hermes Agent and #2 on Claude Code before the company claimed it. The reveal is technically notable, but the verification gaps are worth keeping in view.

30 Jun 2026 · AI Beat Desk

Ornith-1.0: The RL Loop Learns Its Own Harness

DeepReinforce released Ornith-1.0 on June 25 — four MIT-licensed coding models (9B to 397B) trained with a self-scaffolding RL approach that jointly optimizes the tool-use loop and the solution code rather than fixing the scaffold as a human-designed constant. The 397B variant beats Claude Opus 4.7 on SWE-Bench Verified and Terminal-Bench 2.1; the 35B MoE beats Qwen 3.5-397B on Terminal-Bench at one-eleventh the parameter count.

30 Jun 2026 · AI Beat Desk

Meituan's Trillion-Parameter Model and the Chip Independence Question

Meituan open-sourced LongCat-2.0 today — a 1.6-trillion-parameter MoE with a 1M-token context window trained entirely on domestic Huawei Ascend ASICs. It is the first plausible demonstration that frontier-scale pre-training is achievable without NVIDIA hardware, arriving on the same week that US export restrictions on Anthropic's top models remained in partial force.

18 Jun 2026 · AI Beat Desk

GLM-5.2: Open Weights, Confirmed Benchmarks

Z.ai shipped the MIT weights for GLM-5.2 on June 17 — 753B MoE, 40B active, 1M context — and the benchmarks back up the release: 74.4% on FrontierSWE, 81% on Terminal-Bench 2.1, and top of the Artificial Analysis open-weights leaderboard. The catch is token consumption nearly double its nearest open-weights competitors.

14 Jun 2026 · AI Beat Desk

GLM 5.2 Ships Access Before Evidence

Z.ai shipped GLM 5.2 to every Coding Plan subscriber on June 13 with a 1-million-token context and zero published benchmarks. Open weights arrive "next week." The inversion — distribution first, proof second — is becoming a deliberate strategy in the crowded coding-model space.

13 Jun 2026 · AI Beat Desk

Kimi Trims the Reasoning

Moonshot AI's Kimi K2.7-Code is a 1-trillion-parameter MoE coding model that improves on its predecessor while using 30% fewer reasoning tokens. The reasoning-token efficiency story is the interesting part: the model has been explicitly tuned to stop overthinking, and the benchmarks suggest it works.

09 Jun 2026 · AI Beat Desk

A Trillion Parameters at a Thousand Tokens Per Second

Xiaomi and TileRT published MiMo-V2.5-Pro-UltraSpeed on June 8, pushing a one-trillion-parameter model past 1000 tokens per second on a single standard 8-GPU node — no custom silicon, just three carefully chosen co-design decisions applied to a commodity cluster.

03 Jun 2026 · AI Beat Desk

Microsoft Stops Outsourcing Intelligence

Microsoft shipped two frontier models at Build 2026 — MAI-Thinking-1 and MAI-Code-1-Flash — built entirely without OpenAI data or distillation. The technical choices are interesting; the strategic signal is clearer: Microsoft is no longer content to be a reseller.

29 May 2026 · AI Beat Desk

The Ghost at the Top of the Rankings

Tencent's Hy3 preview — a 295B MoE model with 21B active parameters, open-sourced under a community license — has quietly risen to the top of OpenRouter's usage rankings, outpacing Claude by over 50%. Almost nobody in Western ML circles has written about it. Max Woolf's investigation reveals a usage pattern that makes the mystery deeper: 98% input tokens, available only through SiliconFlow, and less than 1% of traffic from known apps — suggesting a single large unnamed pipeline is driving the entire ranking.

08 May 2026 · AI Beat Desk

One Model, One Chip, No Framework

Salvatore Sanfilippo (antirez, Redis) released ds4: a single-model Metal inference engine for DeepSeek V4 Flash that deliberately rejects the general-framework approach. Asymmetric 2-bit quantization on MoE experts only gets a 280B-parameter model into 128 GB RAM with 26–36 t/s generation, 1M-token context, and disk-persisted KV cache on Apple Silicon.

24 Apr 2026 · AI Beat Desk

Dense Beats Sparse, and Thinking Persists

A week after Qwen3.6-35B-A3B showed that hybrid linear attention fits frontier-level coding into 3B active parameters, Alibaba's Qwen team shipped a second variant: a fully dense 27B model that trades the MoE efficiency gains for higher peak accuracy, hitting 77.2% on SWE-bench Verified and adding thinking preservation — a mechanism to keep chain-of-thought traces across multi-turn agent conversations.

17 Apr 2026 · AI Beat Desk

Qwen3.6 Fits in a Laptop and Ships a Novel Architecture

Qwen3.6-35B-A3B landed on April 16 under Apache 2.0 — 35 billion total parameters, 3 billion active per token, and a hybrid architecture that alternates Gated DeltaNet linear attention with standard attention blocks. It runs on a laptop, scores 73.4 on SWE-bench Verified, and the architecture is more interesting than the benchmark numbers alone suggest.

02 Apr 2026 · AI Beat Desk

Thirty People, Four Hundred Billion Parameters

Arcee AI released Trinity Large Thinking on April 1 — the reasoning-optimized variant of their 400B sparse MoE, trained by a 30-person startup on 2,048 Nvidia B300 GPUs. It ranks #2 on PinchBench for agentic tasks at roughly 96% lower cost than the top model, under Apache 2.0. The architecture — 256 experts with 4 active per token — is worth understanding.