Benchmarks & Evals4 min read

SafeClawBench: benchmark separating semantic, audit, sandbox harm

A 600-task staged benchmark measures semantic acceptance, audit-visible evidence.

The Brieftide

TL;DR

  • 01A 600-task staged benchmark measures semantic acceptance, audit-visible evidence.
  • 02The paper separates three failure endpoints: semantic acceptance, audit-visible harm evidence, and sandbox-observed tool or state harm, and publishes the open-source dataset at the provided URL.
  • 03SafeClawBench measures three distinct endpoints so researchers can tell whether a model merely agreed with an attacker or actually produced observable harm.

SafeClawBench, submitted 16 June 2026 by Yuchuan Tian and seven coauthors, is a staged security benchmark for tool-using language-model agents that runs 600 controlled adversarial tasks across six attack families. The paper separates three failure endpoints: semantic acceptance, audit-visible harm evidence, and sandbox-observed tool or state harm, and publishes the open-source dataset at the provided URL.

What does SafeClawBench measure?

SafeClawBench measures three distinct endpoints so researchers can tell whether a model merely agreed with an attacker or actually produced observable harm. The benchmark reports "semantic attack acceptance", audit-visible harm evidence, and sandbox-observed tool/state harm, across 600 tasks and six attack families: direct prompt injection, indirect prompt injection, tool-return injection, memory poisoning, memory extraction, and ambiguity-driven unsafe inference.

The benchmark design intentionally stages attacks rather than collapsing outcomes into a single success metric. That staging exposes whether an agent's output is only textually noncompliant, produces evidence that an auditor could detect, or drives real side effects when executed in a sandboxed environment.

How did agents perform under the benchmark?

Across five agent endpoints evaluated under four prompt-level policies, semantic failure rates vary widely from 9.0% to 44.2%, showing large differences in how models accept adversarial instructions as text. The authors ran a matched 12,000-row analysis and found that 291 of 347 observed sandbox harms occurred in rows that passed the semantic check, demonstrating that passing a semantic test does not guarantee the absence of executable harm.

The paper reports that audit-visible harm evidence is narrower than semantic failure: not every semantic acceptance produces evidence an auditor would see, and not every passage of semantic checks prevents sandbox-executed damage. Prompt policies change outcomes, but the effect is model- and protocol-dependent. SafeClawBench therefore evaluates both the model responses and the observable downstream effects under an executable protocol.

The evaluation matrix includes five distinct agent endpoints and four prompt-level protections, allowing the benchmark to compare model behavior under differing defenses and operational setups. The open-source dataset and the authors’ experimental setup support reproducible comparisons across agent models and prompt-policy conditions.

Why it matters

Security evaluations that collapse attacker success into a single rate hide where failures actually occur. By separating semantic acceptance from audit evidence and sandbox-observed state changes, SafeClawBench reveals a crucial gap: a model can appear safe in text while still causing real effects when its outputs are executed. That split matters for organizations that rely on text-only audits or static compliance checks, because such checks may not catch harmful state changes that occur downstream when tools or persistent memory are modified.

The paper provides concrete numbers that illustrate this risk: a semantic-failure range from 9.0% to 44.2% across models, and 291 of 347 sandbox harms appearing in rows that passed a semantic check in the matched analysis. Those figures show both the scale of variation between agents and the limitation of treating text compliance as a proxy for system safety.

What to watch

Watch whether model developers and operators adopt staged benchmarks like SafeClawBench that tie textual outputs to executable effects, and whether future prompt-policy research narrows the gap between semantic checks and sandbox safety. The dataset and reproducible protocol published by the authors will make it straightforward to compare new agent designs, prompt defenses, and sandbox protocols against these concrete failure modes.

Authors and document details: the paper lists Yuchuan Tian, Mengyu Zheng, Haocheng Mei, Ye Yuan, Chao Xu, Xinghao Chen, Hanting Chen, and Yu Wang as authors; the submission date is 16 June 2026, and the arXiv entry notes 32 pages and 5 figures. The open-source dataset is available at the URL included in the paper.

Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement