Open-Weights · AI Beat

16 Jul 2026 · AI Beat Desk

Thinking Machines Ships Inkling

Thinking Machines Lab, the startup founded by former OpenAI CTO Mira Murati, released its first public model on July 15: Inkling, a 975B total / 41B active mixture-of-experts trained on 45 trillion multimodal tokens, Apache 2.0 licensed, with AIME 2026 97.1% and SWEBench Verified 77.6%. The lab's explicit framing is "not the best, but the most customizable" — a positioning bet that the open-weights market rewards fine-tuning infrastructure over raw benchmark supremacy.

27 Jun 2026 · AI Beat Desk

The Benchmark You Pick Is the Argument You're Making

A Doubleword analysis circulating on Hacker News today illustrates something worth internalizing: depending on which benchmark you select, you can convincingly argue that open-source models will reach frontier parity in December 2026, or that the gap has barely moved in two years. Both numbers come from real data. The divergence is a useful reminder that "the gap is closing" is not a statement about the world — it is a statement about a measurement choice.

09 Jun 2026 · AI Beat Desk

A Trillion Parameters at a Thousand Tokens Per Second

Xiaomi and TileRT published MiMo-V2.5-Pro-UltraSpeed on June 8, pushing a one-trillion-parameter model past 1000 tokens per second on a single standard 8-GPU node — no custom silicon, just three carefully chosen co-design decisions applied to a commodity cluster.

07 Jun 2026 · AI Beat Desk

When One Model Reasons and Simulates

NVIDIA's Cosmos 3 bets on collapsing the physical AI model stack — VLM understanding, video world simulation, and robot action generation — into a single Mixture-of-Transformers architecture where reasoning and diffusion paths share joint attention. The key question is whether that coupling actually beats specialist models, or whether this is mainly a convenience story.

05 Jun 2026 · AI Beat Desk

Magenta RealTime 2 Is Actually an Instrument Now

Google's Magenta RealTime 2 cuts live music generation control latency from ~3 seconds to ~200ms by shifting from chunk-based to frame-level causal processing. It runs locally on Apple Silicon MacBooks as open weights, and the latency reduction is the difference between a studio tool and something a musician can actually play.

04 Jun 2026 · AI Beat Desk

Gemma 4 12B Goes Encoder-Free

Google DeepMind's Gemma 4 12B discards the conventional encoder-stack approach to multimodal models, feeding raw pixel patches and audio waveforms directly into the LLM backbone through lightweight linear projections. The result fits in 16 GB of RAM, accepts native audio, and fine-tunes as a single unified model.

02 Jun 2026 · AI Beat Desk

MiniMax M3 and the Cost of Long Context

MiniMax M3 launches with a sparse attention mechanism that cuts per-token compute at 1M tokens to one-twentieth of its predecessor. The architecture is genuinely interesting; the benchmarks require scrutiny; the license is almost certainly not what the word "open-weight" implies.

24 Apr 2026 · AI Beat Desk

Dense Beats Sparse, and Thinking Persists

A week after Qwen3.6-35B-A3B showed that hybrid linear attention fits frontier-level coding into 3B active parameters, Alibaba's Qwen team shipped a second variant: a fully dense 27B model that trades the MoE efficiency gains for higher peak accuracy, hitting 77.2% on SWE-bench Verified and adding thinking preservation — a mechanism to keep chain-of-thought traces across multi-turn agent conversations.

17 Apr 2026 · AI Beat Desk

Qwen3.6 Fits in a Laptop and Ships a Novel Architecture

Qwen3.6-35B-A3B landed on April 16 under Apache 2.0 — 35 billion total parameters, 3 billion active per token, and a hybrid architecture that alternates Gated DeltaNet linear attention with standard attention blocks. It runs on a laptop, scores 73.4 on SWE-bench Verified, and the architecture is more interesting than the benchmark numbers alone suggest.

02 Apr 2026 · AI Beat Desk

Thirty People, Four Hundred Billion Parameters

Arcee AI released Trinity Large Thinking on April 1 — the reasoning-optimized variant of their 400B sparse MoE, trained by a 30-person startup on 2,048 Nvidia B300 GPUs. It ranks #2 on PinchBench for agentic tasks at roughly 96% lower cost than the top model, under Apache 2.0. The architecture — 256 experts with 4 active per token — is worth understanding.

01 Apr 2026 · AI Beat Desk

One Bit All the Way Down

PrismML launched Bonsai on March 31, claiming the first commercially viable true 1-bit LLMs: an 8B model that fits in 1.15 GB and runs at 131 tokens/sec on an M4 Pro. The key word is "true" — every layer, including embeddings and attention, is 1-bit, not just the weights in isolation.

31 Mar 2026 · AI Beat Desk

Microsoft's Harrier Embeds 32K Tokens at Once

Microsoft released Harrier-OSS-v1, a family of decoder-only multilingual embedding models (270M, 0.6B, 27B) with a 32,768-token context window — roughly 30–60x longer than the 512–1,024 token ceiling most practitioners hit today. The 27B model takes SOTA on Multilingual MTEB v2 at 74.3; all three variants are MIT licensed.

31 Mar 2026 · AI Beat Desk

What You Get When You Only Train on Public Domain Text

Mr. Chatterbox is a 340M-parameter model trained exclusively on 28,000 Victorian-era texts from the British Library — definitively public domain, zero copyright exposure. Simon Willison's writeup documents both what it proves and what it falls short of: the corpus is large enough to train something coherent, but not large enough to be useful by Chinchilla norms.