Reinforcement-Learning

17 Jul 2026 · AI Beat Desk

What Emerges at a Trillion

Ring-Zero scales pure reinforcement learning from verifiable task rewards — no human-labeled preference data — to one trillion parameters. Complex reasoning behaviors emerge spontaneously: self-verification, parallel reasoning, and something the authors call "context anxiety." The two-phase training dynamic (discovery then sharpening) appears to be a consistent pattern as these runs grow larger.

30 Jun 2026 · AI Beat Desk

Ornith-1.0: The RL Loop Learns Its Own Harness

DeepReinforce released Ornith-1.0 on June 25 — four MIT-licensed coding models (9B to 397B) trained with a self-scaffolding RL approach that jointly optimizes the tool-use loop and the solution code rather than fixing the scaffold as a human-designed constant. The 397B variant beats Claude Opus 4.7 on SWE-Bench Verified and Terminal-Bench 2.1; the 35B MoE beats Qwen 3.5-397B on Terminal-Bench at one-eleventh the parameter count.

28 Jun 2026 · AI Beat Desk

The Circuits AI Designs That No Human Would Have Drawn

Princeton's Kaushik Sengupta describes in IEEE Spectrum how reinforcement learning and electromagnetic emulation have crossed a threshold in radio frequency chip design: AI-generated circuits now routinely outperform human-designed ones, and the layouts look like QR codes — novel topologies that no human designer would produce or easily read.

24 Jun 2026 · AI Beat Desk

Simulate the Terminal, Train the Agent

Alibaba's Qwen team released Qwen-AgentWorld, two open-weight models trained to simulate digital-agent environments — terminals, browsers, OS interfaces, software engineering tasks — via chain-of-thought reasoning. The bet is that a sufficiently accurate environment simulator lets you run RL training without real environment calls, which is expensive, slow, and hard to parallelize at scale.

24 May 2026 · AI Beat Desk

The Formatting Tax on Reasoning Models

DelTA identifies a structural problem in RLVR training: the gradient signal used to improve reasoning models is dominated by high-frequency formatting tokens rather than the tokens that actually distinguish good responses from bad ones. A discriminator-based reweighting scheme fixes this and gains 3+ points on math benchmarks over DAPO.

14 May 2026 · AI Beat Desk

More Memory, Worse Agent

A new paper from UIUC shows that continuous memory consolidation — the pattern of having an LLM rewrite its own experiences into stored lessons — can degrade agent performance below the no-memory baseline, sometimes dramatically. GPT-5.4 fails 54% of ARC-AGI problems it had previously solved with clean trajectories after those solutions pass through a consolidation loop. An episodic-only agent that retains raw rollouts without abstraction beats every consolidator tested across five benchmarks.

06 Apr 2026 · AI Beat Desk

AI Wins Live Codeforces Rounds, Three in a Row

A preprint from the DeepReinforce Team claims their GrandCode system placed first in three consecutive live Codeforces rounds in March 2026, defeating all human participants. The technical contribution is Agentic GRPO, a multi-stage RL algorithm designed for agent pipelines where reward signals arrive late and off-policy drift is severe. Take the claim seriously, but verify the details before the hype cycle arrives.