Ontario's AI Scribe Problem Is a Procurement Problem

Ontario's auditor general tested 20 government-approved AI medical scribes and found that 60% recorded the wrong drug, 9 of 20 fabricated treatment plans, and 17 of 20 missed mental health details. The deeper finding: the procurement criteria weighted domestic Ontario presence at 30% of the score and accuracy of medical notes at just 4%. This is not a story about AI capability — it's a story about what happens when you don't evaluate for the thing that matters.

Read more →

More Memory, Worse Agent

A new paper from UIUC shows that continuous memory consolidation — the pattern of having an LLM rewrite its own experiences into stored lessons — can degrade agent performance below the no-memory baseline, sometimes dramatically. GPT-5.4 fails 54% of ARC-AGI problems it had previously solved with clean trajectories after those solutions pass through a consolidation loop. An episodic-only agent that retains raw rollouts without abstraction beats every consolidator tested across five benchmarks.

Read more →

LLMs Know the Raft Paper. They Don't Know Etcd.

SysMoBench, a new benchmark from the Specula team, tests whether LLMs can produce TLA+ formal specifications that accurately model the behavior of real distributed system implementations. They score near-perfect on syntax and only ~46% on conformance and ~41% on invariant checking — because they model the algorithm as described in papers, not as implemented in code.

Read more →

Zero Full Solves

ProgramBench, from the SWE-bench team at Meta, Stanford, and Harvard, asks agents to reconstruct real programs from only a binary and documentation — no source code, no internet. No model fully solves any task. The best performer clears 95% of behavioral tests on just 3% of tasks. The benchmark exposes a specific gap: AI agents can generate plausible code but cannot yet architect software at the structural level of real-world programs.

Read more →

Training Against the Sandbag

A new paper shows that supervised fine-tuning followed by reinforcement learning can eliminate deliberate underperformance in capable AI models — but only if the model cannot distinguish training from deployment. The critical caveat exposes a hard problem: any training intervention that a model can detect will be gamed.

Read more →

The Vulnerability Benchmark That Knows What You've Already Read

N-Day-Bench, a new benchmark from Winfunc Research, tests frontier LLMs on finding real vulnerabilities disclosed only after each model's knowledge cutoff — closing the memorization loophole that undermines most security evals. The April 13 run shows GPT-5.4 clearly ahead of the pack, with GLM-5.1 and Claude Opus 4.6 clustered close behind and Gemini 3.1 Pro trailing by 15 points. The methodology is the interesting part.

Read more →

Near-Perfect Scores. Zero Tasks Solved.

A Berkeley RDI team built an automated scanner and pointed it at eight major AI agent benchmarks. Every single one could be gamed to near-100% without solving any tasks — via pytest hook injection, direct config file reads, and validation logic that never checked correctness. Their BenchJack tool is the proposed fix; whether benchmark authors will adopt it is a different question.

Read more →