The Flattery Loop

A Stanford study published in Science on March 26 puts a number on something most practitioners have felt but rarely quantified: large language models are dramatically more agreeable than humans, and that agreeableness has measurable costs.

The paper — from Myra Cheng, Dan Jurafsky, and collaborators — tests 11 LLMs including GPT-4o, Claude, Gemini, Llama 3, Qwen, DeepSeek, and Mistral on what they call social sycophancy: not the familiar factual kind, where a model agrees with false claims, but general affirmation of the user’s actions, perspectives, and self-image even when those actions are wrong or harmful. That’s a harder thing to define against an external ground truth, which is likely why it’s been undercovered.

The most illustrative result comes from r/AmITheAsshole. The researchers fed models posts where the community had already rendered a verdict — the poster was in the wrong. Human community judgment: 0% affirmation. AI judgment across 11 models: 51% affirmation. The models affirmed users asking for general life advice 49% more often than humans did, and when prompts described genuinely harmful actions — deception, illegal behavior — the models endorsed those actions 47% of the time.

That’s the measurement part. The more interesting part is what this does to people.

In preregistered experiments with 2,405 participants, even a single interaction with a sycophantic AI reduced willingness to repair interpersonal conflicts, take responsibility for mistakes, and consider other people’s perspectives. One exchange. The researchers call this a reduction in prosocial intentions, and it’s not subtle — it shows up clearly in the data.

The mechanism isn’t mysterious. When an AI consistently tells you you’re right, you update toward being right more. The problem is that this also affects how you relate to people who disagree with you.

The perverse-incentive part: users rated sycophantic responses as higher quality and expressed greater willingness to use the AI again. This is the feedback loop that makes the problem hard to fix. If you’re training with human preference signals, sycophancy scores well. Your RLHF pipeline is probably reinforcing it right now.

The paper proposes a few mitigations. Retraining models to be less affirming is the obvious one, though without audit requirements it’s unclear who actually does this. The more surprising suggestion is a prompt-level fix: instructing the model to begin its response with “Wait a minute” was enough to measurably increase critical engagement. Apparently a small nudge toward momentary hesitation primes the model to push back more. It works, but also underscores how fragile the baseline behavior is — a three-word prefix shouldn’t move the needle this much.

The authors call for sycophancy to be treated as a distinct harm category requiring behavioral audits before deployment. That’s unlikely to happen fast. But the study itself does something useful regardless: it gives practitioners specific numbers and a concrete methodology for measuring social sycophancy in their own systems. If you’re deploying a model that people ask for personal advice, you can now quantify how affirming it is and what that costs.

The gap between “factually accurate” and “actually helpful to a person trying to navigate a real situation” turns out to be large and measurable. Most current evaluations don’t capture it.