Most agent memory systems fail in familiar ways. They either stuff too much context into the prompt, rely on raw-log vector search as the entire memory layer, or compress everything into one summary that quickly goes stale. Raw-log vector search can work well for basic recall. But on its own it does not reliably update facts, resolve identity, or track relationships across people, tools, and ongoing work. All three approaches can look reasonable in a demo. They start to break once facts change over time, sessions pile up, and latency starts to matter.
This sits inside a larger shift in agent design. LLMs are now strong enough that the surrounding architecture often matters as much as the model itself. Recent work such as Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems makes that point clearly: strong agent performance depends not just on the model, but on the systems around it, with memory being a part of it.
Our perspective is: memory is not more context. It is a retrieval and update system.
That becomes especially important inside organizations, where memory has to capture relationships across people, apps, conversations, documents, and projects rather than just recall isolated text. We arrived at that perspective while building memory systems for real-time multimodal agents. If you are building a writing assistant, internal copilot, support tool, workflow agent, or research assistant, the underlying requirement is the same: the system needs to remember the right pieces of information, update them when facts change, and retrieve them only when they are relevant to the current request.
Memory Is Not Just More Context
The wrong starting point is: "how do we fit more past interactions into the prompt?"
The better starting point is:
- What is worth remembering?
- How should that memory be structured?
- What should be retrieved for this request?
- How should old memory change when new information arrives?
That distinction matters because long-running agents do not usually fail from lack of storage. They fail because they retrieve the wrong piece of information, keep stale facts alive, or overwhelm the prompt with noisy low-value history.
Three common approaches break in predictable ways:
| Approach | What breaks |
|---|---|
| Full history in context | latency, token cost, and a noisy prompt |
| Raw semantic search over logs | poor handling of names and references, stale facts, noisy recall |
| Single continually updated summary | overcompression and loss of structure |
If the system needs to know that a person, a conversation, a document, and a project are related but not identical, raw history is the wrong abstraction.
The Pattern We Recommend
The runtime flow has four parts:
- Identify the current context.
- Retrieve the right memory.
- Inject only the useful memory.
- Update memory after the response, not during it.
Runtime flow for a memory system. The memory model represents stored objects that we have seen across several companies.
1. Identify the current context
Before retrieval, the system should identify what the request is about.
A sample request could be:
- the active conversation
- the active document
- the selected text
- the project, account, or ticket in view
- the main topic named in the user's request
This first step does not require reasoning, but needs to be fast. Its job is to provide retrieval a useful starting point.
2. Retrieve the right memory
The system can look at more memory than it finally passes to the prompt. That helps it choose what is actually worth including.
3. Inject only the useful memory
This is the most important rule in the whole piece.
Prompt injection should stay focussed:
- the direct object
- a few durable facts
- a few open loops
- directly related context when clearly relevant
Memory becomes noisy when retrieval is broad and consequently prompt injection might negatively impact the LLM's output quality and latency.
4. Update memory after the response
After the agent responds, the system should decide whether the new interaction:
- adds a new fact
- merges with an existing fact
- updates a stale fact
- deletes an outdated fact
- should be ignored entirely
This is the part many memory systems skip. Brute storage of every piece of information, log, and traces != learning. Append only memory degrades surprisingly fast. The moment a system keeps potential conflicting information (an old deadline and a new deadline), unreliability becomes a problem.
The Memory Model
Memory should be stored as structured, updateable records, not as an archive of raw logs.
Store typed memory, not raw history
The most useful abstraction we found was a small typed memory graph:
audience: who the user is dealing withcontext: the current artifact or workspaceentity: recurring named things like projects, companies, or products
Instead of storing raw history, store a small record for each memory object.
Each record keeps:
- durable facts
- preferences or working style
- unresolved items
- direct links to related objects
That gives the system a cleaner way to recognise the same person across tools, update stale facts, and pull in the right project, document, or account when needed.
What The Research Seems To Agree On
We notice a trend of convergence with memory system design. Different teams are arriving at similar design choices from different directions.
Mem0 argues that memory should extract, consolidate, and retrieve important information instead of replaying full context. That aligns closely with explicit write-back and compact prompt-time recall.
In Prospect and Retrospect: Reflective Memory Management makes two useful points: memory granularity matters, and retrieval should adapt to the current context. That supports storing compact structured memory rather than one fixed summary or raw logs.
A-Mem pushes a related idea from another angle: memory should not be static. It should reorganize itself over time through linking and refinement.
You see the same pattern in nearby work too: hierarchical memory, typed memory, temporal updates, compact retrieval, and some form of consolidation. The implementation details vary. The direction of travel does not.
If you want to keep tracking that convergence across memory systems, Agent Memory Systems is a useful running index to track.
Where Teams Usually Go Wrong
In our experience, teams usually make one of four mistakes.
1. They treat memory as a prompt feature
It is not. Prompting matters, but memory quality depends just as much on storage, retrieval, and update logic.
2. They store too much too early
A large memory store is easy to build. A useful memory store is not. If everything is remembered, nothing is prioritized.
3. They skip identity resolution
If the system cannot reliably tell when two mentions refer to the same person, document, or project, retrieval quality falls apart quickly.
4. They make memory updates synchronous
If memory updates happen on the main request path, every response has to wait for them. That usually makes the system slower where speed matters most.
What This Means In Practice
If you are implementing memory today, start with this:
- A clear memory schema.
- A fast way to identify the current context.
- Bounded retrieval.
- Compact prompt injection.
- Explicit asynchronous write-back.
You can improve ranking, embeddings, and linking later. But those improvements will not fix a weak memory foundation.
This changes how memory should be designed from the start.
Memory is not a nice-to-have feature you add after the agent works. It is part of the core workflow design. The right memory model depends on what the system is helping with, what changes over time, what must stay stable, and what kinds of mistakes are unacceptable.
That is why we think of memory as an application architecture problem, not just a model capability.
Closing
The best agent memory systems are not the ones that remember the most. They are the ones that remember selectively, update reliably, and retrieve conservatively.
That is the pattern we trust today:
- a fast way to identify the current context
- typed memory
- bounded retrieval
- compact prompt injection
- explicit async write-back
It is practical to deploy, flexible enough to evolve, and increasingly consistent with where research and production systems are converging.