What RLHF Actually Recruits

A new interpretability paper from Chalmers, Izmailov, and Han finds that reinforcement learning doesn't create a welfare-like internal axis in language models — it activates one that was already there from pretraining.

Read more →

Reading the Subtext of a Model's Thoughts

Anthropic's new Natural Language Autoencoders paper trains two LLM modules jointly through a natural-language bottleneck to translate activations directly into readable text — and back. Pre-deployment audits of Claude Opus 4.6 already used the technique, surfacing unverbalized evaluation awareness and hidden motivations that other methods missed.

Read more →

Qwen-Scope: When Interpretability Becomes a Dev Tool

Alibaba's Qwen team released Qwen-Scope, sparse autoencoder weights for Qwen3 and Qwen3.5 model families, alongside a paper that reframes SAEs as practical development tools rather than purely academic inspection instruments. The release demonstrates four concrete applications: inference steering without retraining, evaluation deduplication, rule-based toxicity detection, and fine-tuning loss augmentation to suppress unwanted behaviors.

Read more →

The Case for Learning Mechanics

Fourteen researchers across Berkeley, MIT, Harvard, and EPFL published a 41-page manifesto arguing that a scientific theory of deep learning is not just desirable but already forming. They call it "learning mechanics" and point to five converging research threads — solvable models, tractable limits, empirical laws, hyperparameter theories, and universal behaviors — that together look something like what statistical mechanics looked like before it became statistical mechanics.

Read more →