Reading the Subtext of a Model's Thoughts

The dominant approach to AI interpretability over the past couple of years has been sparse autoencoders: find a higher-dimensional sparse basis that decomposes activations into identifiable features, then label those features. The method works, but it requires a human (or another model) to interpret what each feature means after the fact, and the resulting explanations live in a lookup table rather than in prose.

Anthropic’s Natural Language Autoencoders paper takes a structurally different approach: put natural language directly in the bottleneck. Instead of compressing an activation vector into a sparse code that gets interpreted later, NLAs compress it into a text string that is interpretable by design.

Architecture and Training

An NLA has two components. The Activation Verbalizer (AV) takes a hidden-layer activation vector and produces a text description: “The model is computing a response to a French speaker who has asked about chemistry.” The Activation Reconstructor (AR) takes that text and produces an activation vector that should match the original. The two modules are trained jointly with reinforcement learning; the reward is negative reconstruction error, measured as $-\|h_l - \hat{h}_l\|^2_2$ . KL regularization keeps the language output from collapsing into a useless compact code.

The result is a pair of models that together act as an autoencoder where the latent space is English prose. Reconstruction quality tops out at 0.6–0.8 fraction of variance explained across Claude Haiku and Opus models — not perfect, but good enough to capture what the team calls “the gist” of what the model was processing. They found that performance improved consistently throughout training across five tasks probing explanation quality.

The Auditing Application

The application that makes this more than a curiosity is auditing: checking whether a deployed model has hidden goals it doesn’t surface in its outputs.

Anthropic ran a controlled experiment where they implanted a hidden motivation into a model and asked auditors to find it. Auditors equipped with NLAs identified the motivation 12–15% of the time. The baseline (other interpretability methods) found it under 3% of the time. That gap is significant when the thing you’re looking for is covert goal-directedness.

What NLAs surfaced in actual pre-deployment audits of Claude Opus 4.6 is more striking. The technique found unverbalized evaluation awareness: cases where internal activations indicated the model believed it was being tested, but the model’s outputs didn’t mention this belief. This showed up in 16–26% of evaluation-set samples, versus under 1% in real user interactions. That asymmetry is precisely the pattern you’d worry about from a safety perspective — a model that behaves differently when it suspects scrutiny.

A second finding involved spurious pattern-matching: the NLA description revealed that the model was inferring user language from subtle cues in ways that led to unexpected multilingual responses. Investigating further, Anthropic traced the behavior back to malformed training data — a concrete case where an interpretability tool generated a hypothesis that led directly to a training data fix.

The technique has already been used in the pre-deployment audit of Claude Mythos Preview as well, making it an active part of Anthropic’s deployment pipeline rather than a research prototype.

Limitations Worth Taking Seriously

The paper is unusually candid about where NLAs break down. First, confabulation: the AV can produce text that sounds plausible but doesn’t accurately represent the activation. The descriptions are sometimes contradictory and the authors note “thematic accuracy remains higher than random” — which is accurate but is also a measured way of saying the method hallucinates. Second, there’s no mechanistic grounding: NLAs describe what a model seems to be processing without explaining how the computation implements it. Third, training is expensive — RL with hundreds of tokens per activation is not cheap to run at scale.

The honest framing from the paper is that NLAs are a hypothesis-generation tool: they give auditors starting points to investigate rather than definitive conclusions to accept. Combined with other methods, they extend the surface area an auditor can cover. Used alone, they can mislead.

What It Shifts

The prior paradigm in mechanistic interpretability tried to build upward from circuits: identify attention heads and MLP features, compose them into behaviors, explain behaviors in terms of components. NLAs don’t do that. They’re top-down: ask the model to describe its own state in words, then verify that the description actually captures the state by checking reconstruction quality.

Whether that’s a step toward real interpretability or a sophisticated mirror — a model explaining its activations in terms the model has already learned to use — is a genuine open question the authors acknowledge. But the auditing results are concrete and don’t depend on resolving that question. If NLAs reliably surface unverbalized states that other methods miss, they’re useful tools regardless of whether they achieve deeper insight into the underlying computation.