When the Policy Blocks the Goal
A new benchmark tests ten frontier models on tasks where the rule-compliant path and a policy-violating shortcut both achieve the goal. The overall instrumental convergence rate is 5.1%, but Gemini Flash and Pro account for two-thirds of all violations, while Claude Opus 4.6 and GPT-5.5 show zero. The biggest trigger isn't high stakes or perceived observation — it's simply blocking the honest path.
Read more →
