Foundation ModelsJuly 2, 20265 min read

MemoryLLM plug-n-play feed-forward memory for Transformers

An ICML paper published July 2026 shows FFNs can be trained as context-free token-wise lookups.

The BrieftideJuly 2, 2026

TL;DR

01An ICML paper published July 2026 shows FFNs can be trained as context-free token-wise lookups.
02MemoryLLM is a paper published July 2026 at ICML that proposes decoupling transformer feed-forward networks so they act as context-free, token-wise neural retrieval memory.
03MemoryLLM reframes feed-forward networks (FFNs) inside transformers as a form of context-free, token-wise memory.

MemoryLLM is a paper published July 2026 at ICML that proposes decoupling transformer feed-forward networks so they act as context-free, token-wise neural retrieval memory. The method trains FFNs in isolation from self-attention using token embeddings, allowing the FFNs to be pre-computed as token-wise lookups (ToLs) and moved on demand between VRAM and storage to improve inference efficiency.

What is MemoryLLM?

MemoryLLM reframes feed-forward networks (FFNs) inside transformers as a form of context-free, token-wise memory. The paper describes decoupling FFNs from self-attention so each FFN can be studied and used as token-wise neural retrieval memory, revealing how input tokens access memory locations within FFN parameters and the role of FFN memory across downstream tasks.

The authors position this change as an interpretability and systems-level move: by treating FFNs as context-free components, they become analyzable token-wise lookups rather than tightly coupled nonlinear transforms dependent on attention context. The paper lists the authors as Ajay Jaiswal, Lauren Hannah, Han-Byul Kim, Duc Hoang, Arnav Kundu, Mehrdad Farajtabar, and Minsik Cho.

How does MemoryLLM work?

MemoryLLM trains FFNs in isolation from self-attention directly using token embeddings so the FFNs behave as context-free mappings from tokens to outputs; those mappings can be pre-computed into token-wise lookups (ToLs). After training, FFNs are realized as ToLs that can be stored and accessed like a retrieval table, enabling on-demand transfer of ToLs between VRAM and storage and thereby improving inference efficiency.

Concretely, the method decouples the usual transformer pipeline: token embeddings feed the trained FFN memory independently of attention, the FFN outputs can be retrieved via lookup rather than re-computed in full during inference, and the architecture supports an operational flow where ToLs move between fast memory (VRAM) and larger storage only as needed. The paper examines how tokens access memory locations inside FFN parameters and measures the importance of that memory across downstream tasks.

How does Flex-MemoryLLM differ from a full MemoryLLM or a standard transformer?

Flex-MemoryLLM sits between a conventional transformer design and the fully decoupled MemoryLLM, and it is intended to close the performance gap that arises when FFNs are trained with context-free token-wise embeddings. Rather than fully removing FFN dependence on attention, Flex-MemoryLLM blends the approaches so that models can retain some context sensitivity while still benefiting from token-wise pre-computation.

The paper frames Flex-MemoryLLM as a practical compromise: it reduces the loss in performance that pure context-free FFN training can cause, while keeping some of the efficiency and interpretability advantages of token-wise FFN memory.

Why it matters

MemoryLLM separates two core transformer components — attention and the FFN — and treats FFNs as an analyzable, pre-computable memory resource. That shift addresses interpretability by making it easier to study how tokens map to parameterized memory locations, and it addresses systems bottlenecks by enabling ToLs to be transferred between VRAM and storage on demand. For developers and researchers, those two effects target both scientific understanding of model internals and practical inference cost.

What to watch

Follow ICML presentations and any released code or checkpoints to see empirical comparisons and task-specific effects; the paper was published July 2026 at ICML and includes the Flex-MemoryLLM variant aimed at narrowing any performance gap from context-free FFN training.

MemoryLLM component layout

Written by The Brieftide · Source: Apple Machine Learning

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

Einstein World Models: LLMs with visual rollouts (arXiv 2026)

An arXiv paper submitted 25 Jun 2026 proposes Einstein World Models, letting LLMs call visual-temporal rollouts as inspectable hypotheses.

The BrieftideDAILY BRIEF

KARLA: KB-augmented retrieval for language models paper

arXiv paper (25 Jun 2026) by Francois Crespin, Fabian M. Suchanek and Nils Holzenberger shows LLMs can query a knowledge base during token.

The BrieftideDAILY BRIEF

Synthetic clinical notes from LLMs: 70-patient longitudinal

William Poulett publishes a modular LLM pipeline and a synthetic dataset of 70 patients.

The BrieftideDAILY BRIEF

Capability Frontier: Benchmarks Miss 82% of LLM Performance

An arXiv paper finds single-model, single-run benchmarks undercount LLM ability; an oracle multi-model approach recovers 82% more.