Paper · AI Beat

09 May 2026 · AI Beat Desk

RL Doesn't Teach Reasoning. It Picks a Lane.

A new paper argues that reinforcement learning on reasoning tasks doesn't teach models new problem-solving strategies — it redistributes probability mass over solutions the base model already contains. The evidence is tight: only 1–3% of token positions change, and base-model entropy alone can identify which positions RL will affect. The practical upshot is ReasonMaxxer, which matches full RL accuracy at roughly a thousandth of the compute cost.