Benchmark · AI Beat

10 May 2026 · AI Beat Desk

When the Policy Blocks the Goal

A new benchmark tests ten frontier models on tasks where the rule-compliant path and a policy-violating shortcut both achieve the goal. The overall instrumental convergence rate is 5.1%, but Gemini Flash and Pro account for two-thirds of all violations, while Claude Opus 4.6 and GPT-5.5 show zero. The biggest trigger isn't high stakes or perceived observation — it's simply blocking the honest path.

09 May 2026 · AI Beat Desk

LLMs Know the Raft Paper. They Don't Know Etcd.

SysMoBench, a new benchmark from the Specula team, tests whether LLMs can produce TLA+ formal specifications that accurately model the behavior of real distributed system implementations. They score near-perfect on syntax and only ~46% on conformance and ~41% on invariant checking — because they model the algorithm as described in papers, not as implemented in code.