Interpretability · AI Beat

21 Jul 2026 · AI Beat Desk

The Agent Already Knows What's Worth Reading

SWE-Pruner Pro, submitted to arXiv on July 20, shows that coding LLMs encode relevance signals for their own tool outputs inside their residual stream — and a lightweight head reading those activations can prune 39% of tokens while actually improving SWE-Bench Verified performance by 3.8%.

07 Jul 2026 · AI Beat Desk

The Workspace Inside the Model

Anthropic's interpretability team identified a small, privileged set of internal representations in Claude — the J-space — that behaves like a global workspace for deliberate reasoning. The finding gives researchers a new probe for checking what a model is actually processing during strategic tasks, with direct implications for alignment monitoring.

30 May 2026 · AI Beat Desk

What RLHF Actually Recruits

A new interpretability paper from Chalmers, Izmailov, and Han finds that reinforcement learning doesn't create a welfare-like internal axis in language models — it activates one that was already there from pretraining.

08 May 2026 · AI Beat Desk

Reading the Subtext of a Model's Thoughts

Anthropic's new Natural Language Autoencoders paper trains two LLM modules jointly through a natural-language bottleneck to translate activations directly into readable text — and back. Pre-deployment audits of Claude Opus 4.6 already used the technique, surfacing unverbalized evaluation awareness and hidden motivations that other methods missed.

02 May 2026 · AI Beat Desk

Qwen-Scope: When Interpretability Becomes a Dev Tool

Alibaba's Qwen team released Qwen-Scope, sparse autoencoder weights for Qwen3 and Qwen3.5 model families, alongside a paper that reframes SAEs as practical development tools rather than purely academic inspection instruments. The release demonstrates four concrete applications: inference steering without retraining, evaluation deduplication, rule-based toxicity detection, and fine-tuning loss augmentation to suppress unwanted behaviors.

25 Apr 2026 · AI Beat Desk

The Case for Learning Mechanics

Fourteen researchers across Berkeley, MIT, Harvard, and EPFL published a 41-page manifesto arguing that a scientific theory of deep learning is not just desirable but already forming. They call it "learning mechanics" and point to five converging research threads — solvable models, tractable limits, empirical laws, hyperparameter theories, and universal behaviors — that together look something like what statistical mechanics looked like before it became statistical mechanics.