More Memory, Worse Agent

Most production agentic systems now include some form of memory. The standard design is a consolidation loop: as the agent accumulates experience, a secondary LLM call rewrites those experiences into compact “lessons” stored in a persistent memory bank. Next time a similar situation arises, the agent retrieves the relevant lessons and adjusts its behavior. The idea sounds like learning: you’re distilling experience into transferable knowledge.

A preprint from the University of Illinois Urbana-Champaign shows this paradigm has a significant failure mode that’s easy to miss because it only appears after enough iterations.

The Setup

The researchers tested memory consolidation across five benchmarks — ALFWorld, ScienceWorld, WebShop, AppWorld, and Mind2Web — plus custom ARC-AGI problem streams. The consolidation loop uses the same LLM that solves tasks to also rewrite its own memories of those solutions. They varied consolidation schedules while holding the pool of underlying trajectories constant, which lets them attribute performance differences to the consolidation process rather than to data quality.

The core experiment with ARC-AGI is striking. Starting with GPT-5.4 on 19 problems it solves correctly with zero memory, they stream the ground-truth solution trajectories through the consolidation loop. After consolidation, the agent fails on 54% of those same problems. The trajectories are clean — they contain correct solutions. The failure is in the rewrite step.

Three Ways Consolidation Breaks Things

The paper identifies three distinct failure modes.

Misgrouping happens when the consolidator pools episodes from different task families. The LLM’s abstraction process finds surface-level similarities across unrelated problems and generates a combined lesson that doesn’t apply cleanly to either. The memory now contains advice that’s technically coherent but contextually wrong.

Interference is the over-generalization problem. When the consolidator abstracts a lesson, it strips the applicability conditions — the caveats about when the strategy works. “Decompose complex tasks into subgoals” is true often enough, but stored without the conditions under which it fails it becomes noise. The more the agent consolidates, the more context-free its stored lessons become.

Overfitting is the narrowing problem. When the input distribution is limited — say, a few hundred tasks from a single benchmark — the consolidator produces memories that describe the surface regularities of the specific examples rather than the underlying strategy. The resulting memories fail on close variants of the same problem family.

On ScienceWorld, the numbers are concrete: an agent with fresh-only consolidation (rewriting only the current task’s experience) outperforms cumulative consolidation (rewriting across all prior tasks) by +203 points. The cumulative approach accumulates over-generalized memories at five times the rate of the fresh approach and outright garbage at twenty times the rate.

The Episodic Alternative

The proposed fix is simple and counterintuitive: don’t consolidate. Store the raw rollouts. Retrieve them at inference time. Don’t abstract, don’t rewrite. The paper calls this an “episodic-only” agent, and it matches or beats every consolidator tested across the benchmark suite.

This is a different design philosophy. Instead of treating memory as a database of compressed lessons, it treats memory as a cache of examples with retrieval. The agent doesn’t learn from its past in the sense of updating a generalized model of the world — it just looks up what it did in similar situations before and uses that as context.

There’s a parallel to the broader debate in machine learning between parametric and non-parametric approaches. Parametric approaches compress knowledge into model weights; non-parametric approaches (nearest-neighbor, retrieval-augmented generation) keep the data around and query it at runtime. The parametric path is more elegant and more interpretable, but it’s lossy and sensitive to the quality of the compression. The consolidation problem is the text-memory version of this: the LLM that rewrites experience into lessons is doing lossy compression, and the losses accumulate.

What This Means for Agentic Systems

The immediate implication is practical. If you’re building an agent that maintains a growing memory bank through iterative consolidation — which describes a large fraction of production agent systems — you should test whether performance degrades over time. The degradation isn’t necessarily visible at low iteration counts; it emerges as the memory bank grows and the consolidation-induced errors compound.

The deeper implication is about what we mean by “learning” in agentic contexts. The intuitive mental model is that a consolidating agent is getting smarter over time, distilling experience the way a skilled practitioner does. The paper suggests this analogy is misleading. Human expert knowledge is heavily situated — experts know not just a strategy but when to apply it and when not to. LLM-based consolidation frequently strips the situational context, producing advice that’s superficially reasonable but missing the metacognitive framing that makes it useful.

The episodic-only approach is less satisfying from a design perspective — there’s something unsatisfying about an agent whose “memory” is just a pile of old transcripts — but it’s more robust. The failure modes for retrieval are different (stale or irrelevant examples surface) and arguably more tractable than the progressive degradation of abstraction-based consolidation.

The project page includes specific figures on failure mode frequency and the benchmark curves across consolidation iterations.