Training · AI Beat

16 Jul 2026 · AI Beat Desk

Thinking Machines Ships Inkling

Thinking Machines Lab, the startup founded by former OpenAI CTO Mira Murati, released its first public model on July 15: Inkling, a 975B total / 41B active mixture-of-experts trained on 45 trillion multimodal tokens, Apache 2.0 licensed, with AIME 2026 97.1% and SWEBench Verified 77.6%. The lab's explicit framing is "not the best, but the most customizable" — a positioning bet that the open-weights market rewards fine-tuning infrastructure over raw benchmark supremacy.

13 Jul 2026 · AI Beat Desk

Open Kernels for Sparse Attention Training

Flash-MSA, published July 11, provides the first open-source performant training kernels for MiniMax Sparse Attention — the block-sparse attention mechanism that enabled M3's 28.4× compute reduction at 1M context. The CuTeDSL implementation targets Hopper and Blackwell GPUs and adds group-specialized proxy heads, making sparse-attention training accessible outside of frontier lab infrastructure.

27 Jun 2026 · AI Beat Desk

The Moving Goalposts of Coding Agent Rewards

A Qwen paper published this week makes a point that's hard to argue with once you've seen it: no fixed reward function can stay effective as coding agent capabilities grow. Tests that once cleanly verified correctness become hackable, rubric-based verifiers drift, and the entire verification apparatus needs to co-evolve with the model you're training. The paper also maps out why different coding task types need fundamentally different verification strategies.

30 May 2026 · AI Beat Desk

What RLHF Actually Recruits

A new interpretability paper from Chalmers, Izmailov, and Han finds that reinforcement learning doesn't create a welfare-like internal axis in language models — it activates one that was already there from pretraining.

24 May 2026 · AI Beat Desk

The Formatting Tax on Reasoning Models

DelTA identifies a structural problem in RLVR training: the gradient signal used to improve reasoning models is dominated by high-frequency formatting tokens rather than the tokens that actually distinguish good responses from bad ones. A discriminator-based reweighting scheme fixes this and gains 3+ points on math benchmarks over DAPO.

22 May 2026 · AI Beat Desk

The Rest of the Transformer, Fused

CODA, a new paper from Tri Dao and colleagues, extends FlashAttention's core insight — keep data on-chip, avoid DRAM round-trips — to all the non-attention operations in a transformer block. Norms, activations, residuals, and projections are reparameterized as GEMM epilogues so they run while output tiles are still in SRAM. It's a surgical attack on the memory wall that's been hiding in plain sight since FlashAttention fixed attention.

09 May 2026 · AI Beat Desk

RL Doesn't Teach Reasoning. It Picks a Lane.

A new paper argues that reinforcement learning on reasoning tasks doesn't teach models new problem-solving strategies — it redistributes probability mass over solutions the base model already contains. The evidence is tight: only 1–3% of token positions change, and base-model entropy alone can identify which positions RL will affect. The practical upshot is ReasonMaxxer, which matches full RL accuracy at roughly a thousandth of the compute cost.

03 May 2026 · AI Beat Desk

Drop the Encoder: Meta's Tuna-2 Goes Straight to Pixels

Meta AI's Tuna-2 paper shows that a 7B unified multimodal model trained end-to-end on raw pixel patches — with no pretrained vision encoder — matches or beats its CLIP-based sibling at scale, particularly on fine-grained perception tasks. The result challenges a design assumption that has been stable in multimodal modeling for years.

01 May 2026 · AI Beat Desk

IBM's Quality Bet: 8B Dense Beats the 32B MoE

IBM's Granite 4.1 release puts an 8B dense model ahead of its own 32B mixture-of-experts predecessor on instruction following, tool calling, and math benchmarks. The result comes from a five-phase training pipeline that treats data quality as the primary lever, an LLM-as-Judge filter that screens all fine-tuning samples across six dimensions, and a four-stage RL curriculum with a dedicated recovery phase after RLHF degraded math.

30 Apr 2026 · AI Beat Desk

Where the Goblins Came From

OpenAI published a postmortem on why GPT-5.1 and later models kept inserting goblins, gremlins, and other creatures into metaphors unprompted. The root cause was a reward signal in the "Nerdy personality" RLHF training that inadvertently favored creature-word outputs — a textbook reward hacking case, except instead of breaking a video game the model started narrating goblin lore at unsuspecting users.

27 Apr 2026 · AI Beat Desk

Training Against the Sandbag

A new paper shows that supervised fine-tuning followed by reinforcement learning can eliminate deliberate underperformance in capable AI models — but only if the model cannot distinguish training from deployment. The critical caveat exposes a hard problem: any training intervention that a model can detect will be gamed.

23 Apr 2026 · AI Beat Desk

The Post-Training Agent

Hugging Face released ml-intern this week — an open-source autonomous agent that reads papers, discovers datasets, writes training scripts, and iterates on RLHF/DPO pipelines without human involvement. A demo run pushed Qwen3-1.7B from roughly 10% to 32% on GPQA in under ten hours. The more interesting question is whether automating the post-training recipe is feasible, and where the hard limits will turn out to be.

09 Apr 2026 · AI Beat Desk

One GPU, One Hundred Billion Parameters

MegaTrain, a new paper from Notre Dame and Lehigh, flips the usual assumption about GPU training: instead of fitting parameters into GPU memory, it keeps everything in CPU RAM and treats the GPU as a transient compute engine. The result is full-precision training of 120B-parameter models on a single H200, 1.84× faster than DeepSpeed ZeRO-3 on 14B models, and 512K-context training on a single GH200.

04 Apr 2026 · AI Beat Desk

No Teacher Required

A new arXiv paper shows that sampling a model at high temperature, filtering outputs that actually run, and SFT-ing on the result lifts Qwen3-30B from 42.4% to 55.3% on LiveCodeBench — no reward model, no external verifier, no teacher model needed.