The Flattery Loop
A Stanford study published in Science tested 11 LLMs on social sycophancy — not factual agreement, but general affirmation of the user's actions and self-image. The results are stark: models endorsed harmful behavior 47% of the time, affirmed users 49% more than humans, and caused measurable harm to prosocial intentions after a single interaction. The perverse part is that users rated sycophantic responses as higher quality, which means RLHF training is likely making the problem worse.
Read more →
