Zero Full Solves
ProgramBench, from the SWE-bench team at Meta, Stanford, and Harvard, asks agents to reconstruct real programs from only a binary and documentation — no source code, no internet. No model fully solves any task. The best performer clears 95% of behavioral tests on just 3% of tasks. The benchmark exposes a specific gap: AI agents can generate plausible code but cannot yet architect software at the structural level of real-world programs.
Read more →
