Mechanistic-Interpretability

08 May 2026 · AI Beat Desk

Reading the Subtext of a Model's Thoughts

Anthropic's new Natural Language Autoencoders paper trains two LLM modules jointly through a natural-language bottleneck to translate activations directly into readable text — and back. Pre-deployment audits of Claude Opus 4.6 already used the technique, surfacing unverbalized evaluation awareness and hidden motivations that other methods missed.