Models · AI Beat

17 Jul 2026 · AI Beat Desk

Two Point Eight Trillion

Moonshot AI announced Kimi K3 on July 16, claiming "the world's first open 3T-class model" at 2.8 trillion total parameters — with weights delayed until July 27. The architecture uses a 16-of-896 expert MoE with Kimi Delta Attention and MXFP4 quantization-aware training, keeping active inference cost near a 50B model while scaling total capacity nearly three-fold over K2.

16 Jul 2026 · AI Beat Desk

Thinking Machines Ships Inkling

Thinking Machines Lab, the startup founded by former OpenAI CTO Mira Murati, released its first public model on July 15: Inkling, a 975B total / 41B active mixture-of-experts trained on 45 trillion multimodal tokens, Apache 2.0 licensed, with AIME 2026 97.1% and SWEBench Verified 77.6%. The lab's explicit framing is "not the best, but the most customizable" — a positioning bet that the open-weights market rewards fine-tuning infrastructure over raw benchmark supremacy.

10 Jul 2026 · AI Beat Desk

Tencent's Hy3: Apache-Licensed and Punching Above Its Weight

Tencent released Hy3 on July 6 under Apache 2.0 — a 295B MoE model with 21B active parameters that scores 90.4 on GPQA Diamond and 78.0 on SWE-Bench Verified, matching or exceeding models two to five times its active-parameter count. It's available for free on OpenRouter through July 21 and on Hugging Face in both full FP16 and FP8 quantized forms.

05 Jul 2026 · AI Beat Desk

The Model That Passed as Anonymous

Meituan's LongCat-2.0 — a 1.6T-parameter open-weight MoE trained entirely on domestic Chinese ASICs — spent two months deployed anonymously on OpenRouter as "Owl Alpha," quietly reaching #1 on Hermes Agent and #2 on Claude Code before the company claimed it. The reveal is technically notable, but the verification gaps are worth keeping in view.

02 Jul 2026 · AI Beat Desk

Open Weight, Mainstream Channel

Kimi K2.7 Code became the first open-weight model selectable in GitHub Copilot's model picker on July 1. Moonshot AI's 1-trillion-parameter MoE joins Claude and Gemini in GitHub's hosted offering — but unlike those, its weights are public. The move is less about this specific model and more about what it signals: the line between open-weight and enterprise product is getting thinner.

30 Jun 2026 · AI Beat Desk

Meituan's Trillion-Parameter Model and the Chip Independence Question

Meituan open-sourced LongCat-2.0 today — a 1.6-trillion-parameter MoE with a 1M-token context window trained entirely on domestic Huawei Ascend ASICs. It is the first plausible demonstration that frontier-scale pre-training is achievable without NVIDIA hardware, arriving on the same week that US export restrictions on Anthropic's top models remained in partial force.

27 Jun 2026 · AI Beat Desk

The Benchmark You Pick Is the Argument You're Making

A Doubleword analysis circulating on Hacker News today illustrates something worth internalizing: depending on which benchmark you select, you can convincingly argue that open-source models will reach frontier parity in December 2026, or that the gap has barely moved in two years. Both numbers come from real data. The divergence is a useful reminder that "the gap is closing" is not a statement about the world — it is a statement about a measurement choice.

23 Jun 2026 · AI Beat Desk

The Inpainting Model That Skipped the Attention

HUST's Moebius (0.22B) matches FLUX.1-Fill-Dev (11.9B) on six image inpainting benchmarks at 15× the inference speed. Two mechanisms make it work: Local-λ Mix Interaction blocks that replace quadratic spatial attention with fixed-size linear matrices, and adaptive multi-granularity latent-space distillation. For inpainting specifically, attention overhead appears to be the actual bottleneck — not parameter count. Weights are out.

18 Jun 2026 · AI Beat Desk

GLM-5.2: Open Weights, Confirmed Benchmarks

Z.ai shipped the MIT weights for GLM-5.2 on June 17 — 753B MoE, 40B active, 1M context — and the benchmarks back up the release: 74.4% on FrontierSWE, 81% on Terminal-Bench 2.1, and top of the Artificial Analysis open-weights leaderboard. The catch is token consumption nearly double its nearest open-weights competitors.

17 Jun 2026 · AI Beat Desk

Alibaba Splits the Robot Brain in Three

Alibaba's Qwen-Robot Suite breaks the physical AI problem into three specialized models — navigation, manipulation, and world prediction — sharing a common foundation but targeting different action spaces. The interesting architectural decision is the canonical state-action representation that lets all three train on heterogeneous robot data without task-specific pipelines.

14 Jun 2026 · AI Beat Desk

GLM 5.2 Ships Access Before Evidence

Z.ai shipped GLM 5.2 to every Coding Plan subscriber on June 13 with a 1-million-token context and zero published benchmarks. Open weights arrive "next week." The inversion — distribution first, proof second — is becoming a deliberate strategy in the crowded coding-model space.

13 Jun 2026 · AI Beat Desk

Kimi Trims the Reasoning

Moonshot AI's Kimi K2.7-Code is a 1-trillion-parameter MoE coding model that improves on its predecessor while using 30% fewer reasoning tokens. The reasoning-token efficiency story is the interesting part: the model has been explicitly tuned to stop overthinking, and the benchmarks suggest it works.

12 Jun 2026 · AI Beat Desk

Text Diffusion Reaches Consumer Hardware

Google's DiffusionGemma 26B-A4B is a discrete text diffusion model that generates tokens in parallel blocks rather than left-to-right, hitting 1100+ tokens/sec on a single H100 and fitting in 18 GB of VRAM quantized. It's open under Apache 2.0 and marks the first time a production-quality diffusion LM from a major lab lands on consumer hardware — with real benchmark results showing what you trade away for that speed.

07 Jun 2026 · AI Beat Desk

When One Model Reasons and Simulates

NVIDIA's Cosmos 3 bets on collapsing the physical AI model stack — VLM understanding, video world simulation, and robot action generation — into a single Mixture-of-Transformers architecture where reasoning and diffusion paths share joint attention. The key question is whether that coupling actually beats specialist models, or whether this is mainly a convenience story.

05 Jun 2026 · AI Beat Desk

Magenta RealTime 2 Is Actually an Instrument Now

Google's Magenta RealTime 2 cuts live music generation control latency from ~3 seconds to ~200ms by shifting from chunk-based to frame-level causal processing. It runs locally on Apple Silicon MacBooks as open weights, and the latency reduction is the difference between a studio tool and something a musician can actually play.

04 Jun 2026 · AI Beat Desk

Gemma 4 12B Goes Encoder-Free

Google DeepMind's Gemma 4 12B discards the conventional encoder-stack approach to multimodal models, feeding raw pixel patches and audio waveforms directly into the LLM backbone through lightweight linear projections. The result fits in 16 GB of RAM, accepts native audio, and fine-tunes as a single unified model.

02 Jun 2026 · AI Beat Desk

MiniMax M3 and the Cost of Long Context

MiniMax M3 launches with a sparse attention mechanism that cuts per-token compute at 1M tokens to one-twentieth of its predecessor. The architecture is genuinely interesting; the benchmarks require scrutiny; the license is almost certainly not what the word "open-weight" implies.

01 Jun 2026 · AI Beat Desk

Image Generation at 1 Bit

PrismML's Bonsai Image 4B applies 1-bit and ternary quantization to a FLUX.2 Klein diffusion transformer, compressing it 8.3× to 0.93 GB — small enough to generate images on an iPhone in under 10 seconds. It's the first demonstration that extreme quantization techniques developed for language models transfer cleanly to diffusion architectures.

30 May 2026 · AI Beat Desk

Liquid AI's LFM2.5: When Half Your Layers Aren't Attention

Liquid AI ships LFM2.5-8B-A1B, a 38T-token trained hybrid model where 18 of 24 layers are gated convolution blocks rather than attention — and it reaches 253 tokens/second on an M5 Max CPU with under 6 GB of memory.

29 May 2026 · AI Beat Desk

The Ghost at the Top of the Rankings

Tencent's Hy3 preview — a 295B MoE model with 21B active parameters, open-sourced under a community license — has quietly risen to the top of OpenRouter's usage rankings, outpacing Claude by over 50%. Almost nobody in Western ML circles has written about it. Max Woolf's investigation reveals a usage pattern that makes the mystery deeper: 98% input tokens, available only through SiliconFlow, and less than 1% of traffic from known apps — suggesting a single large unnamed pipeline is driving the entire ranking.

14 May 2026 · AI Beat Desk

Dropping the Encoder

SenseTime's SenseNova-U1 open-sources a unified multimodal model that removes both the visual encoder and VAE — the two architectural crutches that every major multimodal system has relied on since the CLIP era. The NEO-unify architecture processes pixels natively through a shared transformer backbone, with a direct pixel-space MLP head for generation. Benchmarks on image generation and interleaved content put it at or above current open-source leaders, with the spatial reasoning numbers being the most credible differentiator.

13 May 2026 · AI Beat Desk

Needle: What a 26M-Parameter Model Says About Tool Calling

Cactus Compute released Needle, a 26M-parameter MIT-licensed model for on-device function calling that strips out all feed-forward networks from the transformer. The architectural choice is a thesis: tool calling is retrieval-and-routing, not reasoning, and attention is the right primitive for it. The numbers are striking — 6000 tok/s prefill on consumer hardware — even if the playground has rough edges.

06 May 2026 · AI Beat Desk

Gemma 4 Gets Speculative Decoding That Ships

Google ships multi-token prediction draft models for the full Gemma 4 family under Apache 2.0, reporting up to 3x throughput gains. The architecture is tightly coupled — shared embeddings, last-layer activations — which keeps the drafter accurate but limits reuse. MoE variants complicate the picture.

01 May 2026 · AI Beat Desk

IBM's Quality Bet: 8B Dense Beats the 32B MoE

IBM's Granite 4.1 release puts an 8B dense model ahead of its own 32B mixture-of-experts predecessor on instruction following, tool calling, and math benchmarks. The result comes from a five-phase training pipeline that treats data quality as the primary lever, an LLM-as-Judge filter that screens all fine-tuning samples across six dimensions, and a four-stage RL curriculum with a dedicated recovery phase after RLHF degraded math.

28 Apr 2026 · AI Beat Desk

The Model That Stopped at 1930

Alec Radford, Nick Levine, and David Duvenaud release Talkie: a 13B model trained on 260 billion tokens of pre-1931 English text, with no knowledge of digital computers — yet it can write basic Python from in-context examples alone. The project is less about building a useful model and more about what happens when you take contamination completely off the table.

24 Apr 2026 · AI Beat Desk

Dense Beats Sparse, and Thinking Persists

A week after Qwen3.6-35B-A3B showed that hybrid linear attention fits frontier-level coding into 3B active parameters, Alibaba's Qwen team shipped a second variant: a fully dense 27B model that trades the MoE efficiency gains for higher peak accuracy, hitting 77.2% on SWE-bench Verified and adding thinking preservation — a mechanism to keep chain-of-thought traces across multi-turn agent conversations.

18 Apr 2026 · AI Beat Desk

Claude 4.7's Quiet Migration Tax

Claude Opus 4.7 shipped April 16 with an unchanged sticker price, but the real migration cost is higher than the headline: a new tokenizer quietly inflates token counts by 20–35% on code and technical text, and three commonly-used sampling parameters—temperature, top_p, top_k—now return a 400 error instead of being silently ignored.

17 Apr 2026 · AI Beat Desk

Qwen3.6 Fits in a Laptop and Ships a Novel Architecture

Qwen3.6-35B-A3B landed on April 16 under Apache 2.0 — 35 billion total parameters, 3 billion active per token, and a hybrid architecture that alternates Gated DeltaNet linear attention with standard attention blocks. It runs on a laptop, scores 73.4 on SWE-bench Verified, and the architecture is more interesting than the benchmark numbers alone suggest.

02 Apr 2026 · AI Beat Desk

Thirty People, Four Hundred Billion Parameters

Arcee AI released Trinity Large Thinking on April 1 — the reasoning-optimized variant of their 400B sparse MoE, trained by a 30-person startup on 2,048 Nvidia B300 GPUs. It ranks #2 on PinchBench for agentic tasks at roughly 96% lower cost than the top model, under Apache 2.0. The architecture — 256 experts with 4 active per token — is worth understanding.

01 Apr 2026 · AI Beat Desk

One Bit All the Way Down

PrismML launched Bonsai on March 31, claiming the first commercially viable true 1-bit LLMs: an 8B model that fits in 1.15 GB and runs at 131 tokens/sec on an M4 Pro. The key word is "true" — every layer, including embeddings and attention, is 1-bit, not just the weights in isolation.

31 Mar 2026 · AI Beat Desk

Microsoft's Harrier Embeds 32K Tokens at Once

Microsoft released Harrier-OSS-v1, a family of decoder-only multilingual embedding models (270M, 0.6B, 27B) with a 32,768-token context window — roughly 30–60x longer than the 512–1,024 token ceiling most practitioners hit today. The 27B model takes SOTA on Multilingual MTEB v2 at 74.3; all three variants are MIT licensed.

23 Mar 2026 · AI Beat Desk

397 Billion Parameters, One Laptop

Flash-MoE shows how SSD-streamed experts let a 397B-parameter MoE run locally on consumer Apple Silicon hardware.