Benchmarks · AI Beat

14 Jul 2026 · AI Beat Desk

Apple's On-Device Speech Now Beats Whisper Small

Inscribe's benchmark of Apple's new SpeechAnalyzer API on macOS 26.5.1 finds it achieves 2.12% word error rate versus Whisper Small's 3.74%, while running three times faster — at the cost of covering roughly 30 languages instead of 100+.

09 Jul 2026 · AI Beat Desk

The Ruler Is Broken

OpenAI's audit of SWE-bench Pro finds roughly 30% of tasks are broken, just months after SWE-bench Verified was retired for similar reasons. On the same day, Databricks published results from an internal benchmark built on real merged PRs — test execution, not LLM judges, no contamination. The two announcements together mark a quiet turning point in how serious users of coding agents think about evaluation.

04 Jul 2026 · AI Beat Desk

miniF2F Hits the Ceiling

Mistral's Leanstral 1.5 scores 100% on miniF2F and solves 587 of 672 Putnam Competition problems using a 6B-active-parameter MoE. The model saturates the main formal-proof benchmark and finds real bugs in production code — at roughly $4 per Putnam problem versus competitors charging $300.

02 Jul 2026 · AI Beat Desk

When You Stop Holding the Agent's Hand

Snorkel AI, Princeton, and UW-Madison released Senior SWE-Bench, a coding agent benchmark that replaces precise issue specs with realistic, under-specified requirements and grades solutions on code quality as well as test correctness. Models that clear 88% on SWE-Bench Verified drop to around 24% here. The gap between those numbers is worth examining carefully.

27 Jun 2026 · AI Beat Desk

The Benchmark You Pick Is the Argument You're Making

A Doubleword analysis circulating on Hacker News today illustrates something worth internalizing: depending on which benchmark you select, you can convincingly argue that open-source models will reach frontier parity in December 2026, or that the gap has barely moved in two years. Both numbers come from real data. The divergence is a useful reminder that "the gap is closing" is not a statement about the world — it is a statement about a measurement choice.

22 Jun 2026 · AI Beat Desk

The Model That Manages Models

Sakana AI launched Fugu today: a multi-agent orchestration system packaged as a single OpenAI-compatible API. The underlying claim — that learned coordination beats any individual frontier model on hard tasks — is backed by two ICLR 2026 papers and benchmark numbers that hold up. The detail worth noticing: Fable 5 and Mythos are absent from the agent pool because they're export-controlled. Swappable orchestration isn't just a feature; it's a hedge.

18 Jun 2026 · AI Beat Desk

GLM-5.2: Open Weights, Confirmed Benchmarks

Z.ai shipped the MIT weights for GLM-5.2 on June 17 — 753B MoE, 40B active, 1M context — and the benchmarks back up the release: 74.4% on FrontierSWE, 81% on Terminal-Bench 2.1, and top of the Artificial Analysis open-weights leaderboard. The catch is token consumption nearly double its nearest open-weights competitors.

09 Jun 2026 · AI Beat Desk

The Merge Check

Cognition released FrontierCode on June 8, a coding benchmark that asks whether AI-generated patches would actually be merged into production repositories — not whether the tests happen to pass. Built with 20+ open-source maintainers investing 40+ hours per task, it finds even the best current model (Claude Opus 4.8 at 13.4% Diamond) far from production-ready.

25 May 2026 · AI Beat Desk

When Constraints Stack, Agents Stumble

A new paper studies what happens to LLM coding agents as structural requirements accumulate in backend tasks — architecture constraints, ORM rules, database schemas. The answer is a ~30 percentage-point drop in test pass rates from baseline to fully specified tasks, with database constraints alone responsible for 19pp of that. Flask agents do fine; Django and FastAPI agents do not.

07 May 2026 · AI Beat Desk

Zero Full Solves

ProgramBench, from the SWE-bench team at Meta, Stanford, and Harvard, asks agents to reconstruct real programs from only a binary and documentation — no source code, no internet. No model fully solves any task. The best performer clears 95% of behavioral tests on just 3% of tasks. The benchmark exposes a specific gap: AI agents can generate plausible code but cannot yet architect software at the structural level of real-world programs.

04 May 2026 · AI Beat Desk

When Tools Become Tax

Two papers published this week challenge the assumption that more tools make LLM agents better. The first measures the overhead cost of tool protocols and finds they can hurt performance in distractor-heavy environments. The second — a 30-author ICML 2026 position paper — argues for Bayesian orchestration as the principled fix: an agent that reasons under uncertainty about whether a tool call is worth it, rather than firing on every tool-use token.

26 Apr 2026 · AI Beat Desk

The Cliff in Lambda Calculus

Victor Taelin published LamBench, 120 pure lambda calculus programming problems in a minimal custom language. The results show a hard generational cliff: GPT-5.1, Opus 4.5, and Sonnet 4.5 score exactly 0 out of 120, while the top tier — GPT-5.3 Codex and Opus 4.6 — lands at 90%. The benchmark tests something standard evaluations mostly avoid: symbolic computation that can't be approximated by pattern matching.

14 Apr 2026 · AI Beat Desk

The Vulnerability Benchmark That Knows What You've Already Read

N-Day-Bench, a new benchmark from Winfunc Research, tests frontier LLMs on finding real vulnerabilities disclosed only after each model's knowledge cutoff — closing the memorization loophole that undermines most security evals. The April 13 run shows GPT-5.4 clearly ahead of the pack, with GLM-5.1 and Claude Opus 4.6 clustered close behind and Gemini 3.1 Pro trailing by 15 points. The methodology is the interesting part.

13 Apr 2026 · AI Beat Desk

The Model That Rewrote Its Own Scaffold

MiniMax open-sourced M2.7, a 229B sparse MoE model for coding and agentic work. The interesting part isn't the benchmarks — it's the self-evolution loop: an internal M2.7 instance ran 100+ rounds autonomously modifying its own programming scaffold, keeping what worked and reverting what didn't, and came out 30% better with no per-step human direction. That's a different kind of claim than standard RL post-training.

12 Apr 2026 · AI Beat Desk

Near-Perfect Scores. Zero Tasks Solved.

A Berkeley RDI team built an automated scanner and pointed it at eight major AI agent benchmarks. Every single one could be gamed to near-100% without solving any tasks — via pytest hook injection, direct config file reads, and validation logic that never checked correctness. Their BenchJack tool is the proposed fix; whether benchmark authors will adopt it is a different question.

10 Apr 2026 · AI Beat Desk

Eight Hours in the Shell

Z.AI released GLM-5.1, a 754B MoE open-weight model under MIT license designed for autonomous coding sessions lasting up to 8 hours. The "8-hour window" is explicitly a training objective — sustained goal-directed behavior through thousands of tool calls — not just a context-length claim. It claims the top spot on SWE-Bench Pro with a score of 58.4, ahead of GPT-5.4 and Claude Opus 4.6.

26 Mar 2026 · AI Beat Desk

What Does an AI Actually Know How to Do?

ARC-AGI-3 results expose limits of frontier LLMs on interactive exploration while the LiteLLM compromise underscores escalating supply-chain risk.

24 Mar 2026 · AI Beat Desk

When an AI Writes the Math Paper

A FrontierMath open problem solve and production cost wins from open-weight inference point to rapid capability gains plus shifting AI economics.