AutoMem: Memory skill yields 2x–4x gains on long-horizon games
AutoMem trains file-system memory actions and structures, improving a 32B open-weight agent 2x–4x across Crafter, MiniHack and NetHack.
TL;DR
- 01AutoMem trains file-system memory actions and structures, improving a 32B open-weight agent 2x–4x across Crafter, MiniHack and NetHack.
- 02AutoMem, an approach that treats memory management as a trainable skill, improves long-horizon agent performance across three procedurally generated games by roughly 2x–4x.
- 03The paper, by Shengguang Wu, Hao Zhu, Yuhui Zhang, Xiaohan Wang and Serena Yeung-Levy, was submitted to arXiv on 1 Jul 2026 (arXiv:2607.01224) and reports results on Crafter, MiniHack and NetHack.
AutoMem, an approach that treats memory management as a trainable skill, improves long-horizon agent performance across three procedurally generated games by roughly 2x–4x. The paper, by Shengguang Wu, Hao Zhu, Yuhui Zhang, Xiaohan Wang and Serena Yeung-Levy, was submitted to arXiv on 1 Jul 2026 (arXiv:2607.01224) and reports results on Crafter, MiniHack and NetHack.
What is AutoMem and how does it work?
AutoMem promotes file-system operations to first-class memory actions and trains both the memory structure and the agent's memory proficiency. The system casts memory expertise as metamemory: choosing what to encode, when to retrieve, and how to organize knowledge. AutoMem runs two automated loops: in the first loop a strong LLM reviews complete agent trajectories and iteratively revises the memory structure (prompts, file schemas, action vocabulary) that governs how the agent interacts with memory files; in the second loop the agent's own good memory decisions are identified across many episodes and used as training signal to sharpen the agent's memory proficiency.
The authors emphasize that both axes resist manual tuning because episodes in long-horizon tasks run for thousands of steps and a single memory mistake can remain hidden until much later, making human review of full trajectories impractical. AutoMem automates revision of structure and extraction of training examples from successful episodes to address that scale and delayed credit.
How much performance did AutoMem add?
Optimizing memory alone, without changing the agent's task-action behavior, improved the base agent's performance by approximately 2x–4x. The paper states that this boost brought a 32B open-weight model competitive with frontier systems such as Claude Opus 4.5 and Gemini 3.1 Pro Thinking. The results come from experiments on three procedurally generated long-horizon games: Crafter, MiniHack and NetHack.
The reported gains are specifically attributed to treating file-system operations as first-class actions and to jointly improving the structure that supports memory and the agent's proficiency at exercising that structure. The authors note the improvement occurs when memory is optimized in isolation from other task-action learning.
Why it matters
Memory management is presented as an independently learnable cognitive skill for LLM agents. That reframes a recurring engineering problem: rather than embedding ad hoc caches or manual file schemas, memory can be optimized as a stand-alone objective that yields outsized returns on long-horizon tasks. The paper provides a concrete data point: a roughly 2x–4x performance increase purely from memory optimization, and shows those gains push an open 32B model into the same competitiveness bracket as named frontier systems.
This matters for applications where agents operate over lengthy episodes and where a single overlooked memory step can derail downstream performance. Automating both the structure and the learning signal reduces the human labor needed to find subtle, long-delayed bugs in memory usage and lets teams focus on higher-level policy improvements.
What to watch
Look for reproduction on more varied task families and for open-source releases of the AutoMem loops and the file-schema revision tooling. The paper links to a project website and provides code and data resources alongside the arXiv submission; those artifacts will determine how easily other researchers can validate the 2x–4x claim and try AutoMem on domains beyond Crafter, MiniHack and NetHack.
AutoMem frames memory as metamemory that can be taught, evaluated and transferred. The concrete signals to follow next are whether the approach generalizes beyond procedurally generated games and whether similar automated memory loops can reduce human effort in large-scale agent deployments.
References and source details
- Paper: AutoMem: Automated Learning of Memory as a Cognitive Skill, Shengguang Wu et al., arXiv:2607.01224, submitted 1 Jul 2026.
- Experiments reported on Crafter, MiniHack and NetHack; reported performance improvement ~2x–4x; discussion of a 32B open-weight model becoming competitive with Claude Opus 4.5 and Gemini 3.1 Pro Thinking.
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Coding AgentsAgent4cs: Multi-agent code summarization, up to 38% gains
Agent4cs uses three cooperating agents to summarize large hierarchical codebases.
llm-coding-agent 0.1a0: GPT-5.5 coding agent and tools
Simon Willison published llm-coding-agent 0.1a0 on 2nd July 2026, a PyPI slop-alpha that exposes file.
Mnemosyne agentic transaction system: validation & repair
Mnemosyne implements Agentic Transaction Processing (ATP) to validate AI-generated actions under an executable constraint set C and repair.
Autoformalization: Agent Instructions to Policy-as-Code
A pipeline that uses an LLM generator-critic loop to turn prompts and policy text into Cedar policies, submitted 25 Jun 2026.