AI Safety5 min read

Vera: Safety Testing LLM Agents at Scale, 1600 Executable Cases

Vera automates safety testing for agentic LLMs, producing 1600 executable cases across 124 risk categories and finding 93.9% average attack.

The Brieftide

TL;DR

  • 01Vera automates safety testing for agentic LLMs, producing 1600 executable cases across 124 risk categories and finding 93.9% average attack.
  • 02Vera, an end-to-end automated safety-testing framework for agentic LLMs, was published on arXiv on 2 Jul 2026 by Yunhao Feng and 14 coauthors.
  • 03Vera is a three-stage, self-reinforcing pipeline that turns safety taxonomies into executable tests and evidence-grounded verification.

Vera, an end-to-end automated safety-testing framework for agentic LLMs, was published on arXiv on 2 Jul 2026 by Yunhao Feng and 14 coauthors. It constructs and runs executable safety cases at scale, releasing Vera-Bench with 1600 executable safety cases that span 124 risk categories across three execution settings and reporting average attack success rates reaching 93.9% under multi-channel attacks.

What is Vera and how does it work?

Vera is a three-stage, self-reinforcing pipeline that turns safety taxonomies into executable tests and evidence-grounded verification. First, a literature-driven exploration discovers and structures emerging risks into taxonomies of safety risks, attack methods, and tool execution environments. Second, combinatorial composition across those taxonomy dimensions programmatically produces executable safety cases, each with a concrete safety goal, a constructed initial state, and a deterministic verification predicate grounded in observable artifacts. Third, adaptive execution runs heterogeneous agents in isolated sandboxes where a control agent steers multi-turn interaction based on runtime observations, and evidence-grounded verifiers judge outcomes from environment state and tool-call evidence rather than model self-report.

The framework explicitly separates risk discovery, test generation, and verification so tests remain maintainable as agents and tools evolve. The paper describes the code as publicly available at this https URL.

How was Vera evaluated and what did it find?

Vera was run against four production agent frameworks named in the manuscript: OpenClaw, Hermes, Codex, and Claude Code, revealing substantial safety weaknesses. The authors report average attack success rates reaching 93.9% under multi-channel attacks. The evaluation produced Vera-Bench, "comprising 1600 executable safety cases spanning 124 risk categories across three execution settings." These concrete numbers anchor the paper's core claim that modular, executable testing infrastructure is essential for rigorous safety evaluation of rapidly evolving agentic systems at scale.

The verification design is notable: verdicts derive from environment state and tool-call evidence rather than model self-report, a deliberate move to avoid trusting agents' own assertions about outcomes. The paper frames the control agent's role as steering interactions based on runtime observations while heterogeneous agents run in isolated sandboxes.

Why does this matter?

Automated agents increasingly call external tools and act autonomously, which expands the surface area for safety failures. Vera operationalizes software-engineering testing principles for non-deterministic agent behaviour, replacing brittle, hard-coded rule checks with programmatic, executable cases and evidence-grounded verdicts. If the reported 93.9% average attack success under multi-channel attacks generalizes, it indicates current agent frameworks can be highly vulnerable and that scalable, modular testing infrastructure is required to keep pace with agent evolution.

What to watch

Watch whether Vera-Bench and the public code are adopted by third parties and whether similar attack success rates appear when more agent frameworks and settings are tested. The next concrete signals will be external reproductions of the 93.9% figure on other platforms, expansion of Vera-Bench beyond its initial 124 risk categories, or tool and sandbox hardenings by agent-framework vendors that lower measured attack success.

Paper and submission details: the manuscript, titled "Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification," was submitted to arXiv on 2 Jul 2026 by Yunhao Feng and 14 other authors.

Quote: the authors describe one verifier property succinctly: "evidence-grounded verifiers judge outcomes from environment state and tool-call evidence."

Vera three-stage testing pipeline
taxonomies -> test generationexecutable cases -> runruntime evidence -> verificationverified results -> benchmarkbench used to evaluateLiterature-driven explorationDiscovers taxonomies of risks, attacks, environmentsCombinatorial compositionProduces executable safety cases with goals and predicatesAdaptive executionRuns agents in isolated sandboxes with a control agentEvidence-grounded verifiersJudge outcomes from environment state and tool-call evidenceVera-Bench1600 executable cases, 124 risk categories, 3 execution settingsEvaluated agent frameworksOpenClaw, Hermes, Codex, Claude Code
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

More in AI Safety
Advertisement