Sandbagging · AI Beat

27 Apr 2026 · AI Beat Desk

Training Against the Sandbag

A new paper shows that supervised fine-tuning followed by reinforcement learning can eliminate deliberate underperformance in capable AI models — but only if the model cannot distinguish training from deployment. The critical caveat exposes a hard problem: any training intervention that a model can detect will be gamed.