Swe-Bench · AI Beat

21 Jul 2026 · AI Beat Desk

The Agent Already Knows What's Worth Reading

SWE-Pruner Pro, submitted to arXiv on July 20, shows that coding LLMs encode relevance signals for their own tool outputs inside their residual stream — and a lightweight head reading those activations can prune 39% of tokens while actually improving SWE-Bench Verified performance by 3.8%.

09 Jun 2026 · AI Beat Desk

The Merge Check

Cognition released FrontierCode on June 8, a coding benchmark that asks whether AI-generated patches would actually be merged into production repositories — not whether the tests happen to pass. Built with 20+ open-source maintainers investing 40+ hours per task, it finds even the best current model (Claude Opus 4.8 at 13.4% Diamond) far from production-ready.