WM-SAR: Stable World-Model Correction for Agent Rollouts
WM-SAR repairs the causal subgraph that re-amplifies errors, outperforming scan-and-repair LLM correctors under realistic token budgets.
TL;DR
- 01WM-SAR repairs the causal subgraph that re-amplifies errors, outperforming scan-and-repair LLM correctors under realistic token budgets.
- 02Xinyuan Song and Zekun Cai submitted an arXiv paper on 2 July 2026 (arXiv:2607.01767, v1) that proposes WM-SAR, a world-model corrector designed to repair planning graphs for long agent rollouts.
- 03WM-SAR works backward from subgraph amplification: it identifies the nodes and edges that keep re-amplifying error and sends only that causal subgraph to the LLM.
Xinyuan Song and Zekun Cai submitted an arXiv paper on 2 July 2026 (arXiv:2607.01767, v1) that proposes WM-SAR, a world-model corrector designed to repair planning graphs for long agent rollouts. The authors say replanning entire graphs is impractical for workflows spanning thousands or tens of thousands of steps, and offer WM-SAR as a targeted repair that sends only a causal subgraph to an LLM.
What is WM-SAR and how does it work?
WM-SAR works backward from subgraph amplification: it identifies the nodes and edges that keep re-amplifying error and sends only that causal subgraph to the LLM. In other words, rather than scanning for visible symptoms across nodes and edges, WM-SAR isolates the subgraph responsible for recurring amplification of mistakes and uses that compact region as the repair target for the language model.
The paper positions WM-SAR as a distinct family of corrector. The authors contrast it with the common engineering approach that scans nodes and edges, selects a suspicious local region, and asks an LLM to repair it. WM-SAR reverses that workflow by diagnosing amplification paths first and then repairing the causal structure in place.
How does WM-SAR compare with engineering LLM correctors?
WM-SAR substantially outperforms engineering correctors under realistic token budgets, achieves near-whole-graph stabilization with a compact region, and gives the LLM a cleaner repair target. The authors implemented strong engineering LLM correctors and report that those methods can help, especially when given very large contexts, but that they are less effective when context is limited.
The paper highlights three practical limits of full-graph replay: it consumes large context budgets, it exposes the LLM to many irrelevant symptoms, and it can degrade long-context retrieval. Engineering correctors try to mitigate this by scanning and repairing local regions, but the authors argue that visible symptoms do not always identify the causal amplifiers. WM-SAR aims to fix the amplifier rather than treat surface symptoms, which the experiments in graph simulations and LLM repair trials reportedly validate.
Why it matters
Agent planning is moving toward persistent workflows that may span thousands or tens of thousands of steps, where failures will occur inside large planning graphs rather than as isolated predictions. A repair method that requires replaying or re-evaluating the entire graph will strain context budgets and reduce the effective range of long-horizon agents. WM-SAR addresses the missing component the authors identify: a world-model corrector that repairs a failed planning graph in place, reducing the scope sent to the LLM while stabilizing the overall plan.
If WM-SAR’s claims hold across more workloads, practitioners building long-horizon planners or persistent agent systems could reduce token budgets and improve robustness by diagnosing and repairing causal amplifiers instead of performing broader, symptom-driven repairs.
What to watch
The paper is listed as under review on arXiv and includes the submission metadata and linked resource toggles on the arXiv page; the initial upload is arXiv:2607.01767 [cs.AI], submitted 2 Jul 2026 (v1, 8,011 KB). Watch for peer-reviewed publication, public release of the authors’ repair experiments or code attachments on the arXiv entry, and follow-up evaluations that test WM-SAR on real-world persistent workflows spanning thousands or tens of thousands of steps.
| Item | |||||
|---|---|---|---|---|---|
| Engineering LLM correctors | Scan nodes and edges; choose a suspicious local region | Suspicious local region identified by visible symptoms | Requires very large contexts to be most effective | Can help, especially with very large contexts | Helpful but less effective under realistic token budgets |
| WM-SAR (World-Model Subgraph Amplification Repair) | Work backward from subgraph amplification to find causal subgraph | Causal subgraph that keeps re-amplifying error | Compact region fits realistic token budgets | Designed for long rollouts where full-graph replay is infeasible | Substantially outperforms engineering correctors; near-whole-graph stabilization |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Foundation ModelsEinstein World Models: LLMs with visual rollouts (arXiv 2026)
An arXiv paper submitted 25 Jun 2026 proposes Einstein World Models, letting LLMs call visual-temporal rollouts as inspectable hypotheses.
KARLA: KB-augmented retrieval for language models paper
arXiv paper (25 Jun 2026) by Francois Crespin, Fabian M. Suchanek and Nils Holzenberger shows LLMs can query a knowledge base during token.
Synthetic clinical notes from LLMs: 70-patient longitudinal
William Poulett publishes a modular LLM pipeline and a synthetic dataset of 70 patients.
Capability Frontier: Benchmarks Miss 82% of LLM Performance
An arXiv paper finds single-model, single-run benchmarks undercount LLM ability; an oracle multi-model approach recovers 82% more.