Procedural Memory Distillation: PMD boosts benchmarks
An arXiv paper submitted 1 Jul 2026 introduces Procedural Memory Distillation (PMD).
TL;DR
- 01An arXiv paper submitted 1 Jul 2026 introduces Procedural Memory Distillation (PMD).
- 02Across Qwen3-8B and OLMo3-Instruct-7B, the authors report PMD improves over SDPO by 3.8 to 5.5 percent on SCIKNOWEVAL and by 7.9 to 13.6 percent on LIVECODEBENCH.
- 03Procedural Memory Distillation, or PMD, is a training technique that extracts recurring, cross-episode procedural signals from a model's own rollouts and distills them into the policy's parameters.
Procedural Memory Distillation, introduced in an arXiv paper submitted 1 Jul 2026 by Ye Liu, Srijan Bansal, Bo Pang, Yang Li, Zeyu Leo Liu, Yifei Ming, Zixuan Ke, Shafiq Joty and Semih Yavuz, converts cross-episode signals into reusable procedural memory and distills that memory into a model's weights. Across Qwen3-8B and OLMo3-Instruct-7B, the authors report PMD improves over SDPO by 3.8 to 5.5 percent on SCIKNOWEVAL and by 7.9 to 13.6 percent on LIVECODEBENCH.
What is Procedural Memory Distillation?
Procedural Memory Distillation, or PMD, is a training technique that extracts recurring, cross-episode procedural signals from a model's own rollouts and distills them into the policy's parameters. The paper frames these signals as a three-level memory hierarchy: raw trajectories, self-reflected strategies and lessons, and higher-level behavioral patterns that recur across problems.
PMD's memory is designed as a temporary training scaffold: it supervises learning during training but is absorbed into the policy so the deployed model remains memory-free at inference. The authors contrast this with episode-level updates such as reinforcement learning with verifiable rewards and SDPO, which evaluate and update from single-rollout verifier signals but do not retain cross-episode procedural knowledge.
How does PMD work in practice?
PMD operates online during training by co-evolving a policy and a procedural memory: the policy generates rollouts that update the memory, and the memory conditions a self-teacher that supervises the student policy on its own rollouts. This co-evolution loop is central to the method, the authors write, because it lets accumulated experience shape subsequent supervision while the policy continues producing new trajectories.
The memory is organized at three abstraction levels. First, raw trajectories capture exact sequences of actions. Second, self-reflected strategies and lessons abstract recurring successful or failing tactics. Third, higher-level behavioral patterns identify strategies that recur across distinct problems. A memory-conditioned self-teacher uses those abstractions to create supervision signals for the student, and those signals are distilled into the student’s weights so the final model requires no external memory at inference.
How much does PMD improve models compared with SDPO?
PMD outperforms SDPO on the benchmarks the authors evaluated. On SCIKNOWEVAL, PMD improves over SDPO by between 3.8 percent and 5.5 percent across Qwen3-8B and OLMo3-Instruct-7B. On LIVECODEBENCH, PMD's gains range from 7.9 percent to 13.6 percent. The paper also presents an ablation: freezing either the memory or the policy causes performance to trail PMD by more than 10 percent across SCIKNOWEVAL domains, which the authors use to argue that co-evolution drives the improvements.
These numbers are reported at the paper level as aggregated improvements across the two evaluated families of models, Qwen3-8B and OLMo3-Instruct-7B. The submission is available on arXiv as arXiv:2607.01480.
Why it matters
PMD targets a persistent gap in reinforcement-style training, namely that episode-level verifier signals miss recurring strategies and failure modes that only emerge across episodes. By converting cross-episode patterns into a distilled procedural memory that is eventually absorbed into model weights, PMD aims to transfer those recurring lessons into a model's latent behavior without requiring an external memory at inference. That approach could matter for domains where procedural patterns repeat across tasks, from code generation to multi-step reasoning.
What to watch
Look for follow-up evaluations that break down the reported improvement ranges by model, task domain and failure type, and for open-source code or checkpoints linked from the paper. The authors submitted the paper on 1 Jul 2026, and the next signals will be code releases or more granular replication studies that validate the reported 3.8–5.5 percent and 7.9–13.6 percent gains.
| Item | |||
|---|---|---|---|
| SCIKNOWEVAL | 3.8-5.5% | Across Qwen3-8B and OLMo3-Instruct-7B | |
| LIVECODEBENCH | 7.9-13.6% | Across Qwen3-8B and OLMo3-Instruct-7B | |
| Ablation (memory or policy frozen) | >10% | Trailing PMD across SCIKNOWEVAL domains |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Model CompressionUnconventional AI Un-0: oscillator model promises 1,000x lower
Naveen Rao's startup released Un-0, an image model on an oscillator-based architecture aiming for 1,000x inference power savings.
Agentic evolution: physically constrained foundation models
A multi-agent engine uses an Evolutionary Knowledge Graph to evolve Q-Enhance and MoE-Salient-AQ.
CompressKV: KV-cache compression keeps 97% with 3%
Semantic-retrieval-guided framework CompressKV preserves over 97% of full-cache performance on LongBench using 3% of KV storage.
LLM distillation: scaling laws and FinHeadlineMix release
An arXiv paper (submitted 23 Jun 2026) derives empirical scaling laws for task-specific LLM compression and publishes the FinHeadlineMix.