Foundation ModelsJune 24, 20265 min read

MemClaw: Governed Shared Memory for Multi-Agent LLMs

MemClaw implements scoped retrieval, temporal supersession, provenance tracking and policy-governed propagation.

The BrieftideJune 24, 2026

TL;DR

01MemClaw implements scoped retrieval, temporal supersession, provenance tracking and policy-governed propagation.
02The paper identifies four foundational failure modes: unauthorized leakage, stale propagation, contradiction persistence, and provenance collapse.
03The evaluation produces specific operational results: provenance reconstruction succeeded for 100% of depth-four derivation chains with correct writer identity at sub-second per-hop latency.

MemClaw, a production multi-tenant memory service, implements explicit systems-level primitives for governed shared memory in multi-agent LLM environments, and it was presented in a paper submitted on 23 Jun 2026 by Yanki Margalit, Nurit Cohen-Inger, Erni Avram, Ran Taig and Oded Margalit. The study formalizes the fleet-memory problem, identifies four foundational failure modes, implements four memory primitives in MemClaw, and evaluates the live service with a reproducible harness called ArgusFleet.

What did the paper do?

MemClaw and ArgusFleet codify governance for fleet memory by defining four failure modes and four corresponding systems primitives, then measuring a live production service rather than a synthetic baseline. The paper identifies four foundational failure modes: unauthorized leakage, stale propagation, contradiction persistence, and provenance collapse. To address them the authors define scoped retrieval, temporal supersession, provenance tracking, and policy-governed memory propagation, implement those primitives in MemClaw, and evaluate them via ArgusFleet across four governance dimensions.

The evaluation produces specific operational results: provenance reconstruction succeeded for 100% of depth-four derivation chains with correct writer identity at sub-second per-hop latency. Propagation tests demonstrated high intra-fleet visibility with zero cross-fleet leakage. Under a strong write mode, write-to-visible latency was optimized to a single search round-trip.

How does MemClaw work and what failures did it expose?

MemClaw implements four explicit primitives—scoped retrieval, temporal supersession, provenance tracking, and policy-governed memory propagation—and the live evaluation exposed enforcement and pipeline-ordering issues that design-only analyses missed. The authors report a production architectural issue they call Asymmetric Scope Enforcement: tenant isolation held overall, but sub-tenant scope was initially bypassed on direct GET-by-id requests for agent-scoped credentials; that bypass was disclosed and remediated during the study.

A second production issue was a Pipeline Ordering Conflict: although contradiction supersession handles admitted writes, a synchronous near-duplicate gate can prematurely reject contradictory writes before the asynchronous contradiction detector evaluates them. These findings show that primitives alone are not sufficient without careful pipeline ordering and enforcement checks in the live service.

Why it matters

The paper shows that long-context retrieval by itself does not solve multi-agent fleet memory problems and that explicit systems-level abstractions are required to govern shared state reliably. Live evaluation of a production service exposed concrete enforcement gaps and ordering bugs that would likely remain invisible in design-only or synthetic tests. For operators of multi-tenant LLM fleets, the results tie a concrete set of governance primitives to measurable outcomes such as 100% provenance reconstruction at depth four and zero cross-fleet leakage, making the trade-offs and failure modes actionable.

What to watch

Watch whether other teams replicate MemClaw's provenance result—100% reconstruction of depth-four derivation chains at sub-second per-hop latency—and whether pipeline-ordering fixes are adopted to avoid premature rejection by synchronous near-duplicate gates. Also monitor whether policy-governed memory propagation patterns from MemClaw appear in other production multi-tenant services or in subsequent evaluations using ArgusFleet.

References and context: the paper was submitted on 23 Jun 2026 and presents these findings as part of its core contribution. The authors conclude, quote, "Long-context retrieval alone is insufficient for production multi-agent memory," and recommend explicit systems-level abstractions plus live evaluation to surface real-world enforcement and ordering failures.

MemClaw system components and evaluation flows

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

BIM-Edit: Benchmarking LLMs for IFC-based BIM Editing

BIM-Edit evaluates LLMs on 324 IFC editing tasks across 11 real models and 36 synthetic scenes; the top model averages 49.5%.

The BrieftideDAILY BRIEF

QMFOL benchmark: QMFOLBench with 2880 logic instances

QMFOL generates monadic first-order logic problems and ships QMFOLBench with 2880 instances to measure LLM deductive reasoning across.

The BrieftideDAILY BRIEF

DeFAb: Defeasible Abduction Benchmark, 372,648+ instances

DeFAb converts four decades of publicly funded knowledge bases into 372.

The BrieftideDAILY BRIEF

LLMs: gpt-4o, gpt-4.1-mini and claude-sonnet-4.6 study

Analysis of 21,000 multi-turn conversations finds human-like behaviors vary by model and user and can be modulated by system prompts.