Research / Notes

Memory for Long-Running Agents: What Actually Works

A practical guide to memory for long-running agents: what breaks, what architecture to use, and what research is converging towards.

April 29 2026Updated April 29 2026Abdullah Al-Hayali for Bay Labs Research

Most agent memory systems fail in familiar ways. They either stuff too much context into the prompt, rely on raw-log vector search as the entire memory layer, or compress everything into one summary that quickly goes stale. Raw-log vector search can work well for basic recall. But on its own it does not reliably update facts, resolve identity, or track relationships across people, tools, and ongoing work. All three approaches can look reasonable in a demo. They start to break once facts change over time, sessions pile up, and latency starts to matter.

This sits inside a larger shift in agent design. LLMs are now strong enough that the surrounding architecture often matters as much as the model itself. Recent work such as Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems makes that point clearly: strong agent performance depends not just on the model, but on the systems around it, with memory being a part of it.

Our perspective is: memory is not more context. It is a retrieval and update system.

That becomes especially important inside organizations, where memory has to capture relationships across people, apps, conversations, documents, and projects rather than just recall isolated text. We arrived at that perspective while building memory systems for real-time multimodal agents. If you are building a writing assistant, internal copilot, support tool, workflow agent, or research assistant, the underlying requirement is the same: the system needs to remember the right pieces of information, update them when facts change, and retrieve them only when they are relevant to the current request.

Memory Is Not Just More Context

The wrong starting point is: "how do we fit more past interactions into the prompt?"

The better starting point is:

  1. What is worth remembering?
  2. How should that memory be structured?
  3. What should be retrieved for this request?
  4. How should old memory change when new information arrives?

That distinction matters because long-running agents do not usually fail from lack of storage. They fail because they retrieve the wrong piece of information, keep stale facts alive, or overwhelm the prompt with noisy low-value history.

Three common approaches break in predictable ways:

ApproachWhat breaks
Full history in contextlatency, token cost, and a noisy prompt
Raw semantic search over logspoor handling of names and references, stale facts, noisy recall
Single continually updated summaryovercompression and loss of structure

If the system needs to know that a person, a conversation, a document, and a project are related but not identical, raw history is the wrong abstraction.

The Pattern We Recommend

The runtime flow has four parts:

  1. Identify the current context.
  2. Retrieve the right memory.
  3. Inject only the useful memory.
  4. Update memory after the response, not during it.
Memory for long-running agents system diagram

Runtime flow for a memory system. The memory model represents stored objects that we have seen across several companies.

1. Identify the current context

Before retrieval, the system should identify what the request is about.

A sample request could be:

  • the active conversation
  • the active document
  • the selected text
  • the project, account, or ticket in view
  • the main topic named in the user's request

This first step does not require reasoning, but needs to be fast. Its job is to provide retrieval a useful starting point.

2. Retrieve the right memory

The system can look at more memory than it finally passes to the prompt. That helps it choose what is actually worth including.

3. Inject only the useful memory

This is the most important rule in the whole piece.

Prompt injection should stay focussed:

  • the direct object
  • a few durable facts
  • a few open loops
  • directly related context when clearly relevant

Memory becomes noisy when retrieval is broad and consequently prompt injection might negatively impact the LLM's output quality and latency.

4. Update memory after the response

After the agent responds, the system should decide whether the new interaction:

  • adds a new fact
  • merges with an existing fact
  • updates a stale fact
  • deletes an outdated fact
  • should be ignored entirely

This is the part many memory systems skip. Brute storage of every piece of information, log, and traces != learning. Append only memory degrades surprisingly fast. The moment a system keeps potential conflicting information (an old deadline and a new deadline), unreliability becomes a problem.

The Memory Model

Memory should be stored as structured, updateable records, not as an archive of raw logs.

Store typed memory, not raw history

The most useful abstraction we found was a small typed memory graph:

  • audience: who the user is dealing with
  • context: the current artifact or workspace
  • entity: recurring named things like projects, companies, or products

Instead of storing raw history, store a small record for each memory object.

Each record keeps:

  • durable facts
  • preferences or working style
  • unresolved items
  • direct links to related objects

That gives the system a cleaner way to recognise the same person across tools, update stale facts, and pull in the right project, document, or account when needed.

What The Research Seems To Agree On

We notice a trend of convergence with memory system design. Different teams are arriving at similar design choices from different directions.

Mem0 argues that memory should extract, consolidate, and retrieve important information instead of replaying full context. That aligns closely with explicit write-back and compact prompt-time recall.

In Prospect and Retrospect: Reflective Memory Management makes two useful points: memory granularity matters, and retrieval should adapt to the current context. That supports storing compact structured memory rather than one fixed summary or raw logs.

A-Mem pushes a related idea from another angle: memory should not be static. It should reorganize itself over time through linking and refinement.

You see the same pattern in nearby work too: hierarchical memory, typed memory, temporal updates, compact retrieval, and some form of consolidation. The implementation details vary. The direction of travel does not.

If you want to keep tracking that convergence across memory systems, Agent Memory Systems is a useful running index to track.

Where Teams Usually Go Wrong

In our experience, teams usually make one of four mistakes.

1. They treat memory as a prompt feature

It is not. Prompting matters, but memory quality depends just as much on storage, retrieval, and update logic.

2. They store too much too early

A large memory store is easy to build. A useful memory store is not. If everything is remembered, nothing is prioritized.

3. They skip identity resolution

If the system cannot reliably tell when two mentions refer to the same person, document, or project, retrieval quality falls apart quickly.

4. They make memory updates synchronous

If memory updates happen on the main request path, every response has to wait for them. That usually makes the system slower where speed matters most.

What This Means In Practice

If you are implementing memory today, start with this:

  1. A clear memory schema.
  2. A fast way to identify the current context.
  3. Bounded retrieval.
  4. Compact prompt injection.
  5. Explicit asynchronous write-back.

You can improve ranking, embeddings, and linking later. But those improvements will not fix a weak memory foundation.

This changes how memory should be designed from the start.

Memory is not a nice-to-have feature you add after the agent works. It is part of the core workflow design. The right memory model depends on what the system is helping with, what changes over time, what must stay stable, and what kinds of mistakes are unacceptable.

That is why we think of memory as an application architecture problem, not just a model capability.

Closing

The best agent memory systems are not the ones that remember the most. They are the ones that remember selectively, update reliably, and retrieve conservatively.

That is the pattern we trust today:

  • a fast way to identify the current context
  • typed memory
  • bounded retrieval
  • compact prompt injection
  • explicit async write-back

It is practical to deploy, flexible enough to evolve, and increasingly consistent with where research and production systems are converging.

If you'd like to understand how memory fits into your workflow, reach out.

Get in touch