There’s a specific version of the AI coding hype that goes: models can already do most of the work, and the remaining gap is shrinking fast. ProgramBench, a new benchmark from the team that built SWE-bench, does a good job of locating exactly how wide that gap still is — and what kind of gap it is.

The setup is deliberately minimal. An agent is given the compiled binary of a real program and its documentation. No source code. No internet access. Sandboxed container. The task: write a codebase that behaves identically to the original executable. The 200 tasks range from compact CLI tools like jq, fzf, and gron up to major open-source software: FFmpeg, SQLite, the PHP interpreter. Correctness is checked against behavioral test suites generated by agent-driven fuzzing against the reference binary — 248,853 tests in total, with a median of 770 per task. The evaluation avoids specifying implementation structure; tests only care whether the candidate produces the right outputs, exit codes, and filesystem effects.

The results are stark. Across 9 models tested, not one fully resolves a single task. The best performer — Claude Opus 4.7 — passes 95% or more of the behavioral tests on about 3% of tasks. Claude Opus 4.6 reaches 2.5% of tasks at that threshold, Sonnet 4.6 about 1.6%, and every other model evaluated sits at 0%. Most of the partial success is concentrated on the simpler end: nnn, fzf, gron. Programs like FFmpeg, the PHP compiler, typst, and ast-grep are described as “out of reach” for all evaluated models.

The structural finding is as interesting as the pass rates. Models consistently generated shorter, more monolithic code than the original programs — a median of 1,173 lines versus 3,068 in the reference codebases. Human-written software tends to be decomposed: separate modules, layered abstractions, clean interfaces between components. Agent-generated code collapses that structure. This isn’t just a style observation; monolithic architectures fail behavioral tests differently than modular ones, and the failure modes are harder to debug or iterate on.

This connects to something Simon Willison wrote the day before about how the boundary between vibe coding and disciplined agentic engineering is eroding in his own practice. He no longer reviews every line of AI-generated code that goes into production, rationalizing it with a team-trust model — the same way you trust a department rather than reading every email they send. Willison is careful and self-aware about this; he acknowledges the discomfort. But his observation reflects a real shift: the volume of AI-generated code is outpacing the practical ability to review it.

ProgramBench suggests that shift carries real risk. The benchmark’s tasks aren’t contrived puzzles — they’re asking agents to do something that a skilled engineer would consider a significant but tractable project: reimplement a well-understood program from its specification and observable behavior. The fact that no model comes close on the complex cases, and that agents produce structurally different code than humans would write, implies that AI tools are producing code that looks right locally but diverges from what a thoughtful designer would build at the architectural level. That kind of divergence is exactly what code review exists to catch — and it’s what gets harder to catch as review rates drop.

The benchmark comes from John Yang, Kilian Lieret, and collaborators at Meta, Stanford, and Harvard — the same group behind SWE-bench, which has done more than any other evaluation to shape how the field thinks about AI software engineering capability. ProgramBench is positioned as a longer-horizon companion: SWE-bench tests whether agents can fix specific bugs in existing codebases; ProgramBench tests whether they can build something coherent from the ground up. The 0% full-resolution rate suggests that the latter remains firmly in human territory, even as the former continues to improve.

Code for the benchmark is available on GitHub.