Memory for Long-Running Agents: What Actually Works

Most agent memory systems fail in familiar ways. They either stuff too much context into the prompt, rely on raw-log vector search as the entire memory layer, or compress everything into one summary that quickly goes stale. Raw-log vector search can work well for basic recall. But on its own it does not reliably update facts, resolve identity, or track relationships across people, tools, and ongoing work. All three approaches can look reasonable in a demo. They start to break once facts change over time, sessions pile up, and latency starts to matter.

This sits inside a larger shift in agent design. LLMs are now strong enough that the surrounding architecture often matters as much as the model itself. Recent work such as Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems makes that point clearly: strong agent performance depends not just on the model, but on the systems around it, with memory being a part of it.

Our perspective is: memory is not more context. It is a retrieval and update system.

That becomes especially important inside organizations, where memory has to capture relationships across people, apps, conversations, documents, and projects rather than just recall isolated text. We arrived at that perspective while building memory systems for real-time multimodal agents. If you are building a writing assistant, internal copilot, support tool, workflow agent, or research assistant, the underlying requirement is the same: the system needs to remember the right pieces of information, update them when facts change, and retrieve them only when they are relevant to the current request.

Memory Is Not Just More Context

The wrong starting point is: "how do we fit more past interactions into the prompt?"

The better starting point is:

What is worth remembering?
How should that memory be structured?
What should be retrieved for this request?
How should old memory change when new information arrives?

That distinction matters because long-running agents do not usually fail from lack of storage. They fail because they retrieve the wrong piece of information, keep stale facts alive, or overwhelm the prompt with noisy low-value history.

Three common approaches break in predictable ways:

Approach	What breaks
Full history in context	latency, token cost, and a noisy prompt
Raw semantic search over logs	poor handling of names and references, stale facts, noisy recall
Single continually updated summary	overcompression and loss of structure

If the system needs to know that a person, a conversation, a document, and a project are related but not identical, raw history is the wrong abstraction.

The runtime flow has four parts:

Identify the current context.
Retrieve the right memory.
Inject only the useful memory.
Update memory after the response, not during it.

Memory for long-running agents system diagram

Runtime flow for a memory system. The memory model represents stored objects that we have seen across several companies.

1. Identify the current context

Before retrieval, the system should identify what the request is about.

A sample request could be:

the active conversation
the active document
the selected text
the project, account, or ticket in view
the main topic named in the user's request

This first step does not require reasoning, but needs to be fast. Its job is to provide retrieval a useful starting point.

2. Retrieve the right memory

The system can look at more memory than it finally passes to the prompt. That helps it choose what is actually worth including.

3. Inject only the useful memory

This is the most important rule in the whole piece.

Prompt injection should stay focussed:

the direct object
a few durable facts
a few open loops
directly related context when clearly relevant

Memory becomes noisy when retrieval is broad and consequently prompt injection might negatively impact the LLM's output quality and latency.

4. Update memory after the response

After the agent responds, the system should decide whether the new interaction:

adds a new fact
merges with an existing fact
updates a stale fact
deletes an outdated fact
should be ignored entirely

This is the part many memory systems skip. Brute storage of every piece of information, log, and traces != learning. Append only memory degrades surprisingly fast. The moment a system keeps potential conflicting information (an old deadline and a new deadline), unreliability becomes a problem.

The Memory Model

Memory should be stored as structured, updateable records, not as an archive of raw logs.

Store typed memory, not raw history

The most useful abstraction we found was a small typed memory graph:

audience: who the user is dealing with
context: the current artifact or workspace
entity: recurring named things like projects, companies, or products

Instead of storing raw history, store a small record for each memory object.

Each record keeps:

durable facts
preferences or working style
unresolved items
direct links to related objects

That gives the system a cleaner way to recognise the same person across tools, update stale facts, and pull in the right project, document, or account when needed.

What The Research Seems To Agree On

We notice a trend of convergence with memory system design. Different teams are arriving at similar design choices from different directions.

Mem0 argues that memory should extract, consolidate, and retrieve important information instead of replaying full context. That aligns closely with explicit write-back and compact prompt-time recall.

In Prospect and Retrospect: Reflective Memory Management makes two useful points: memory granularity matters, and retrieval should adapt to the current context. That supports storing compact structured memory rather than one fixed summary or raw logs.

A-Mem pushes a related idea from another angle: memory should not be static. It should reorganize itself over time through linking and refinement.

You see the same pattern in nearby work too: hierarchical memory, typed memory, temporal updates, compact retrieval, and some form of consolidation. The implementation details vary. The direction of travel does not.

If you want to keep tracking that convergence across memory systems, Agent Memory Systems is a useful running index to track.

Where Teams Usually Go Wrong

In our experience, teams usually make one of four mistakes.

1. They treat memory as a prompt feature

It is not. Prompting matters, but memory quality depends just as much on storage, retrieval, and update logic.

2. They store too much too early

A large memory store is easy to build. A useful memory store is not. If everything is remembered, nothing is prioritized.

3. They skip identity resolution

If the system cannot reliably tell when two mentions refer to the same person, document, or project, retrieval quality falls apart quickly.

4. They make memory updates synchronous

If memory updates happen on the main request path, every response has to wait for them. That usually makes the system slower where speed matters most.

What This Means In Practice

If you are implementing memory today, start with this:

A clear memory schema.
A fast way to identify the current context.
Bounded retrieval.
Compact prompt injection.
Explicit asynchronous write-back.

You can improve ranking, embeddings, and linking later. But those improvements will not fix a weak memory foundation.

This changes how memory should be designed from the start.

Memory is not a nice-to-have feature you add after the agent works. It is part of the core workflow design. The right memory model depends on what the system is helping with, what changes over time, what must stay stable, and what kinds of mistakes are unacceptable.

That is why we think of memory as an application architecture problem, not just a model capability.

Closing

The best agent memory systems are not the ones that remember the most. They are the ones that remember selectively, update reliably, and retrieve conservatively.

That is the pattern we trust today:

a fast way to identify the current context
typed memory
bounded retrieval
compact prompt injection
explicit async write-back

It is practical to deploy, flexible enough to evolve, and increasingly consistent with where research and production systems are converging.