The Merge Check
Cognition released FrontierCode on June 8, a coding benchmark that asks whether AI-generated patches would actually be merged into production repositories — not whether the tests happen to pass. Built with 20+ open-source maintainers investing 40+ hours per task, it finds even the best current model (Claude Opus 4.8 at 13.4% Diamond) far from production-ready.
Read more →
