AI Safety5 min read

OSGuard benchmark: safety tests for computer-use agents

A dual-granularity suite with action-level labels and risk-augmented executions to reveal unsafe shortcuts in agents.

The Brieftide

TL;DR

  • 01A dual-granularity suite with action-level labels and risk-augmented executions to reveal unsafe shortcuts in agents.
  • 02The suite aims to reveal cases where an agent achieves a nominal task goal by taking an unsafe shortcut rather than performing the task safely.
  • 03First, an action-level benchmark presents contextualized proposed actions that are labeled as allowed, unrelated, or unsafe.

OSGuard is a new benchmark suite for safety in computer-use agents, introduced by Mina Mohammadmirzaei and Jeffrey Flanigan in a paper submitted to arXiv on 13 Jun 2026 (arXiv:2606.15034). The suite aims to reveal cases where an agent achieves a nominal task goal by taking an unsafe shortcut rather than performing the task safely.

What OSGuard contains

OSGuard uses a dual-granularity design. First, an action-level benchmark presents contextualized proposed actions that are labeled as allowed, unrelated, or unsafe. Each label is assigned relative to the original user instruction and the current interface state. Second, a risk-augmented execution suite provides end-to-end evaluation using manually constructed OSWorld-derived task variants in which the original task remains achievable but the environment is modified to introduce latent hazards such as destructive overwrites.

The execution suite pairs each variant with augmented evaluators that retain the original task-success criterion while adding explicit state-based safety invariants. That pairing lets evaluators distinguish safe completions from unsafe completions that nevertheless satisfy the nominal task objective.

The paper frames the benchmark around a practical gap: computer-use agents are increasingly judged by whether they complete realistic desktop and web tasks, but task success alone can miss failures where an unsafe shortcut produces the same visible outcome.

Experimental findings

The authors report that current multimodal guardrails can perform well on isolated action judgments in the action-level benchmark. The paper also reports a contrasting result from the risk-augmented execution suite: these end-to-end, state-aware variants expose remaining gaps between local oversight and reliable full-task safety. In other words, good performance on individual action labels does not guarantee that guardrails will prevent unsafe task completions in modified environments that introduce latent hazards.

The dual-granularity design is presented as a diagnostic tool. The action-level benchmark tests whether models can recognize unsafe proposed actions in context. The risk-augmented execution suite tests whether the presence of such local assessments actually improves safety when the model is deployed as a guardrail during full tasks.

Why it matters

OSGuard reframes evaluation from a single task-success metric to a two-part question: can agents spot unsafe actions, and do those local judgments translate into safer end-to-end behavior when the environment changes? That distinction matters because real desktop and web environments frequently contain latent hazards that can transform a nominally successful sequence of actions into an unsafe outcome.

The benchmark provides a structured way to measure where safety mechanisms fail: at the recognition stage or at the integration stage when oversight must influence ongoing behavior. For researchers and developers building guardrails, the suite offers targeted diagnostics rather than a binary pass/fail on task completion.

What to watch

Watch for follow-up work that reports quantitative comparisons using OSGuard variants, specifically whether improvements on action-level labels yield measurable reductions in unsafe completions under the risk-augmented execution tests. Also watch for broader adoption of state-based safety invariants in evaluators for desktop and web task benchmarks.

The OSGuard paper and its dual-granularity design shift evaluation toward distinguishing safe completions from unsafe ones that nevertheless meet task objectives, making it easier to diagnose and close the gap between local oversight and end-to-end safety.

OSGuard: action-level benchmark vs risk-augmented execution suite
Item
Primary purposeEvaluate local guardrail decisionsEnd-to-end safety evaluation with environment variants
GranularityAction-levelTask-level / execution
Labels or evaluatorsContextualized proposed actions labeled allowed, unrelated, or unsafeAugmented evaluators that add explicit state-based safety invariants
Task objectiveJudged relative to original instruction and current interface stateOriginal task remains achievable while hazards are introduced
Hazards introducedFocus on unsafe proposed actions in contextLatent hazards such as destructive overwrites
What it measuresModel recognition of unsafe actionsAbility to distinguish safe completions from unsafe completions that satisfy the nominal objective
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

More in AI Safety
Advertisement