The Merge Check · AI Beat

SWE-Bench asks a specific, answerable question: does the test suite pass after the patch? That’s a useful signal. It’s also a narrow one. A patch can make the tests green while introducing a brittle implementation, modifying files it shouldn’t touch, writing tests that only pass on the fixed code (but wouldn’t catch regressions), or producing something a maintainer would politely decline at review.

FrontierCode, published June 8 by Cognition, asks the broader question: would a human maintainer actually merge this?

The benchmark was built with 20+ open-source maintainers who spent 40+ hours each constructing tasks. Each task includes not just expected behavior but a rubric — authored by the actual project maintainer — covering six dimensions: behavioral correctness, regression safety, mechanical cleanliness, test correctness, scope, and code quality. That last category is deliberately subjective; it’s what makes a patch look like it was written by someone who understands the codebase rather than someone trying to pass a test.

Two of the evaluation methods are worth noting. Reverse-classical testing checks whether the tests an agent writes will actually fail on broken code — a subtle but important property that standard test correctness checks miss. You can write tests that always pass by asserting the wrong thing. Code scope checks verify that a patch modifies only files and lines that are necessary. Scope creep is a real signal: an agent that touches unrelated files to make a test pass is doing something wrong even if the output is functionally equivalent.

The results are sobering in the expected direction. Claude Opus 4.8 leads the Diamond tier (50 hardest tasks) at 13.4%. GPT-5.5 scores 6.3%. Gemini 3.1 Pro reaches 4.7%. On the Extended tier (150 tasks, easier), Claude Opus 4.8 reaches 51.8%, which sounds better until you remember that means it fails roughly half the tasks on the accessible portion of the benchmark. The false positive rate is 81% lower than SWE-Bench Pro, which is the point: it’s harder to fake your way through.

The benchmark also carries an implicit indictment of SWE-Bench leaderboard numbers. Prior work from METR found that high-scoring models frequently produce patches that wouldn’t pass a maintainer’s code review — functionally correct but operationally wrong in ways that matter to a real project. FrontierCode builds that intuition into the evaluation rubric from the start.

There’s a limitation worth acknowledging: the tasks themselves are currently withheld from public release to prevent contamination. Evaluations are available to model creators, which makes FrontierCode less useful as an external auditing tool and more like an independent lab report for now. Whether that changes depends on how quickly models start gaming whatever small set leaks out.

The deeper question FrontierCode raises is architectural. Current coding agents are trained to maximize task completion on benchmarks where “completion” means passing tests. If the training signal doesn’t include scope discipline, regression safety, or code quality, you shouldn’t expect those properties to emerge — and apparently they largely haven’t. Moving the goalposts to “mergeable” requires moving the training signal too.