AI SafetyJune 24, 20264 min read

SAFARI fault attribution: 20% gains on Who&When benchmark

SAFARI replaces linear context loading with a tool-augmented diagnostic loop and a persistent Short-Term Memory to scale long-horizon agent.

The BrieftideJune 24, 2026

TL;DR

01SAFARI replaces linear context loading with a tool-augmented diagnostic loop and a persistent Short-Term Memory to scale long-horizon agent.
02SAFARI replaces linear context loading with an active, tool-augmented diagnostic loop that decouples fault attribution accuracy from an LLM's context window.
03The framework treats fault attribution as an active investigation rather than a one-shot summarization.

SAFARI replaces linear context loading with an active, tool-augmented diagnostic loop that decouples fault attribution accuracy from an LLM's context window. The framework equips models with a toolbox for reading and searching trajectory segments plus a persistent Short-Term Memory ("STM") for cross-turn reasoning, allowing diagnosis of faults whose traces extend beyond native context limits.

What is SAFARI and how does it work?

SAFARI is a diagnostic framework that swaps the usual approach of loading entire execution trajectories into a single context for an interactive loop that uses specialized tools and a persistent memory. Concretely, SAFARI gives an LLM a toolbox to read and search trajectory segments alongside a Short-Term Memory for cross-turn reasoning, replacing linear context loading and avoiding attention dilution when traces exceed context limits.

The framework treats fault attribution as an active investigation rather than a one-shot summarization. The toolbox accesses discrete trajectory segments so the model can fetch and inspect targeted slices of long multi-step, multi-agent executions. The STM preserves intermediate findings across turns so cross-turn reasoning does not depend on the architecture's native context window.

How does SAFARI perform on benchmarks?

SAFARI outperforms prior methods by 20% on the Who&When dataset within a 1M token budget, and by 19% on the TRAIL GAIA subset on a 25K token budget. The authors also report that SAFARI maintains a precision of 0.58 when the target fault lies 5x beyond the model's native context window, a scenario where traditional evaluators fail entirely.

Those figures come from experiments reported in the paper. The 20% and 19% improvements are measured against state-of-the-art results under the stated token budgets. The maintained 0.58 precision highlights SAFARI's robustness in cases where execution traces exceed the model's native context by multiple factors.

How does this compare with current practices?

Current diagnostic approaches typically load the full trajectory into an LLM's context window. The paper says that strategy suffers from attention dilution and fails when agentic traces surpass context limits. SAFARI's tool-augmented loop removes the need for a single monolithic context by enabling targeted reads and searches over segments, while STM carries forward cross-turn inferences.

That change in architecture is what allows SAFARI to keep diagnostic accuracy independent of the model's context length. The authors frame the approach as scaling long-horizon agentic fault attribution rather than expanding raw context size.

Why it matters

Long-running multi-step and multi-agent tasks produce execution traces that easily exceed even the largest model contexts. A diagnostic method whose accuracy depends on fitting entire traces into one context will degrade as horizons grow. SAFARI demonstrates a practical alternative: instrument the reasoning process with retrieval-style tools and persistent memory. If the reported 20% and 19% gains hold in other settings, practitioners will have a way to diagnose failures in longer-horizon agents without relying on ever-larger context windows.

That shift affects who can audit or debug agentic systems. Teams constrained by model context limits can adopt active investigation tooling and memory to preserve diagnostic performance across longer traces.

What to watch

The paper was submitted on 23 June 2026 and was published at the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026. The next concrete signal will be whether independent evaluations reproduce the 20% and 19% improvements across other datasets and token budgets, and whether practitioners adopt tool-augmented loops and STM for operational agent debugging.

If subsequent work shows the same precision advantages when faults sit many times beyond native context windows, SAFARI's approach will likely influence fault-attribution toolchains for long-horizon agent deployments.

References: SAFARI: Scaling long-horizon Agentic Fault AttRibution via active Investigation, Chenyang Zhu et al., arXiv submission 23 June 2026, published at AIWILD (ICML 2026).

SAFARI benchmark results and stress test

Item
Who&When improvement over SOTA	Who&When	1M token budget	20%
TRAIL GAIA subset improvement	TRAIL GAIA subset	25K token budget	19%
Precision when fault 5x beyond context	Fault 5x beyond native context	Stress condition	0.58 precision (traditional evaluators fail)

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

OpenAI joins Appia Foundation to build shared AI standards

OpenAI supports evaluation frameworks, safety practices and global cooperation through the Appia Foundation.

The BrieftideDAILY BRIEF

AI4SE and SE4AI: A decade review of AI in systems engineering

H. Sinan Bank, Daniel R. Herber and Thomas Bradley map three research phases and assess 1.

The BrieftideDAILY BRIEF

Dario Amodei's AI playbook: Anthropic's regulation plan

Amodei urges binding third-party audits, federal power to block risky models, export controls.

The BrieftideDAILY BRIEF

Germany approves DE-AISI, an AI security institute based on UK

The National Security Council authorised a German AI Security Institute to test advanced models.