A-TMA improves ghost-memory benchmarks: LTP + LoCoMo gains
A-TMA overlays long-term agent memories to label current, historical and transition facts, improving conflict accuracy by 0.240 on LTP.
TL;DR
- 01A-TMA overlays long-term agent memories to label current, historical and transition facts, improving conflict accuracy by 0.240 on LTP.
- 02The paper names the failure mode "ghost memory" and proposes architecture and evaluation changes to measure and reduce it.
- 03The authors frame ghost memory as a state coordination failure in which old, current and transition facts coexist in the memory bank, remain mixed during retrieval and mislead the answer model.
A-TMA, submitted to arXiv on 2 Jul 2026 by Zitong Shi, Yixuan Tang and Anthony Kum Hoe Tung, is a state-aware overlay for long-term agent memory that separates current, historical and transition facts so answers do not get misled by stale records. The paper names the failure mode "ghost memory" and proposes architecture and evaluation changes to measure and reduce it.
How does A-TMA work?
A-TMA decouples memory maintenance, retrieval, and answer-time resolution by keeping superseded and transition records in the bank, building evidence packets for the query's requested state view, and exposing current, historical and transition labels to the question-answering step. The overlay sits on top of existing memory systems, preserving older records rather than discarding them and explicitly marking their state so the downstream answer model can resolve conflicts instead of inheriting mixed facts.
The authors frame ghost memory as a state coordination failure in which old, current and transition facts coexist in the memory bank, remain mixed during retrieval and mislead the answer model. A-TMA's three-level view aims to make each stage—bank maintenance, retrieval, and answer resolution—observable and optimizable independently.
How was A-TMA evaluated and what were the results?
The paper introduces LTP (LoCoMo Temporal Plus), a conflict-heavy benchmark, and uses LoCoMo to test long conversation generalization; it evaluates ATMA deployed with an existing host memory system called Graphiti. On LTP, Graphiti+ATMA improves conflict accuracy by 0.240 absolute over Graphiti. On LoCoMo, Graphiti+ATMA raises temporal F1 from 0.0295 to 0.1705. The authors note the gains are host dependent.
Those two concrete figures are the clearest empirical signals the paper provides: a 0.240 absolute improvement in conflict accuracy on the new LTP benchmark, and a temporal F1 increase on LoCoMo from 0.0295 to 0.1705 when Graphiti is augmented with ATMA.
Why it matters
Final QA accuracy can hide where memory systems fail; mixed state in agent memories produces confident but incorrect answers when historical and current facts are not separated. The paper argues for decoupled evaluation of bank, retrieval and answer-level failures so researchers and engineers can locate whether the issue is maintenance, retrieval mixing, or answer-time resolution. The reported gains show that explicitly encoding state roles in memory entries can reduce these hidden failures.
This matters for agents intended to act as persistent assistants where user facts change over time, because keeping outdated records without clear state labels lets old and new facts interfere with retrieval and answer generation.
What to watch
Watch for replication of ATMA's results across other host memory systems beyond Graphiti, since the authors state the gains are host dependent. Also watch adoption of the LTP benchmark and LoCoMo extensions to surface ghost-memory failures separately at bank, retrieval and answer stages.
| Item | |||
|---|---|---|---|
| Temporal F1 (LoCoMo) | 0.0295 | 0.1705 | Raised from 0.0295 to 0.1705 |
| Conflict accuracy (LTP) | — | — | Graphiti+ATMA improves conflict accuracy by 0.240 absolute over Graphiti |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIMMIR-TCM: multimodal TCM AI framework outperforms GPT-4o, Gemini
MMIR-TCM pairs Memory-SAM, fine-tuned Qwen3-VL and a Qwen3 RAG pipeline.
MIT Masked IRL: LLMs help robots clarify and ignore cues
MIT’s Masked IRL uses two LLMs to clarify vague prompts, cut demonstration data nearly fivefold.
Multimodal LLM evaluation: four missing capabilities (2026)
A paper by Po-han Li et al. finds benchmarks miss temporal-spatial coherence, physical-world understanding.
ReMMD: Multilingual Multi-Image Benchmark and Agent Release
ReMMD introduces ReMMDBench (500 samples, 2,756 images) and ReMMD-Agent; GPT-5.2 yields 41.80% accuracy and 39.12% macro-F1.