Benchmarks & EvalsJune 18, 20264 min read

WorldLines benchmark: Long-horizon stateful embodied agents

WorldLines provides temporally extended household traces for Memory QA and Embodied Task Planning, and evaluates ObsMem.

The BrieftideJune 18, 2026

TL;DR

01WorldLines provides temporally extended household traces for Memory QA and Embodied Task Planning, and evaluates ObsMem.
02WorldLines, a benchmark and dataset introduced in a paper submitted to arXiv on 17 Jun 2026, constructs temporally extended household traces to test long-horizon embodied assistance.
03The paper, by Yehang Zhang, Jianchong Su, Haojian Huang, Yifan Chang, Tianhao Zhou, Xinli Xu, Yingjie Xu, Yinchuan Li, Zexi Li, and Ying-Cong Chen (arXiv:2606.18847), is 27 pages long with 18 figures.

WorldLines, a benchmark and dataset introduced in a paper submitted to arXiv on 17 Jun 2026, constructs temporally extended household traces to test long-horizon embodied assistance. The paper, by Yehang Zhang, Jianchong Su, Haojian Huang, Yifan Chang, Tianhao Zhou, Xinli Xu, Yingjie Xu, Yinchuan Li, Zexi Li, and Ying-Cong Chen (arXiv:2606.18847), is 27 pages long with 18 figures.

What is WorldLines and what does it test?

WorldLines is a project-driven benchmark that creates temporally extended household traces containing dialogues, actions, execution feedback, and object and device state changes. The traces are converted into evidence-linked samples for two evaluation tasks: Memory QA and Embodied Task Planning, enabling evaluation of memory use in dynamic, long-horizon domestic settings.

The dataset emphasizes stateful, time-extended interaction: each trace captures not just single tasks but sequences where prior interactions change world states and future decisions must account for history. That framing contrasts with existing long-term memory benchmarks that largely target language-centric retrieval and QA, and with embodied benchmarks that focus on short-horizon task execution without testing long-term memory use in dynamic environments.

How does ObsMem work and what role does it play?

ObsMem is presented as an "observer-grounded memory framework" that maintains visibility-aware memories and action-native state trails to support state-aware decisions. In practice ObsMem tracks what an observer (the agent) has seen and creates state trails tied to actions, aiming to supply the kind of persistent, visibility-conditioned memory needed for embodied planning.

The paper positions ObsMem as a reference architecture for this long-horizon, stateful setting. Experiments using WorldLines evaluate how well systems translate long-term memory into embodied plans, with ObsMem serving as a stronger baseline architecture against which remaining gaps are measured.

What did the experiments find?

Experiments with WorldLines expose persistent challenges in three areas: partial observability, overwritten world states, and translating long-term memory into embodied plans. The paper reports that these issues continue to impede robust household assistance over extended interactions, and that while ObsMem offers a stronger reference architecture, the challenges are not fully solved.

Those experimental findings underscore that memory in embodied agents is more than retrieval: agents must reason about what they have seen, what actions changed, and when prior states have been superseded. WorldLines’ evidence-linked samples are designed to surface these failure modes during evaluation.

Why it matters

WorldLines shifts evaluation from isolated question answering or short tasks to temporally extended, stateful household scenarios, forcing memory systems to operate under partial observability and evolving device and object states. That change in evaluation focus matters because practical household assistance depends on handling sequences of interactions and remembering which world states remain valid over time. ObsMem’s visibility-aware, action-native trails offer a concrete architectural direction for researchers to build on.

What to watch

Look for follow-up work that applies WorldLines traces to additional models and that measures whether architectures patterned on ObsMem reduce failures from partial observability and overwritten states. Adoption of WorldLines for Memory QA and Embodied Task Planning benchmarks will be the clearest signal that the community is treating long-horizon, stateful evaluation as central.

References: arXiv:2606.18847, submitted 17 Jun 2026; paper length 27 pages, 18 figures. The authors are Yehang Zhang; Jianchong Su; Haojian Huang; Yifan Chang; Tianhao Zhou; Xinli Xu; Yingjie Xu; Yinchuan Li; Zexi Li; and Ying-Cong Chen.

WorldLines vs ObsMem: feature comparison

Item
Temporal traces	Yes, temporally extended household traces	Uses traces as inputs for memory-aware decisions
Included modalities	Dialogues, actions, execution feedback, object and device state changes	Visibility-aware memories and action-native state trails
Evaluation tasks	Memory QA; Embodied Task Planning	Provides memory architecture for these tasks
Primary research focus	Benchmarking long-horizon stateful embodied assistance	Reference architecture for state-aware decisions
Challenges highlighted	Partial observability; overwritten world states; translating long-term memory into plans	Aims to mitigate these via visibility and action-native state trails

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

RTSGameBench: RTS benchmark for strategic reasoning by VLMs

RTSGameBench evaluates vision-language models in Beyond All Reason using mini-games.

The BrieftideDAILY BRIEF

ForecastBench-Sim: Simulated-World Forecasting Benchmark

A benchmark built on Freeciv game rollouts that generates solvable forecasting tasks with configurable horizons.

The BrieftideDAILY BRIEF

TxBench-PP: 100 preclinical pharmacology tasks, top score 59.3%

TxBench-PP is a verifiable benchmark of 100 small-molecule preclinical decisions across 11 models and 4.

The BrieftideDAILY BRIEF

LLM-as-Judge: Curriculum-Grounded Marking Pipeline for Exam Prep

A staged LLM workflow that grounds question marking in authorised syllabus artefacts.