OSGuard benchmark: safety tests for computer-use agents
A dual-granularity suite with action-level labels and risk-augmented executions to reveal unsafe shortcuts in agents.
TL;DR
- 01A dual-granularity suite with action-level labels and risk-augmented executions to reveal unsafe shortcuts in agents.
- 02The suite aims to reveal cases where an agent achieves a nominal task goal by taking an unsafe shortcut rather than performing the task safely.
- 03First, an action-level benchmark presents contextualized proposed actions that are labeled as allowed, unrelated, or unsafe.
OSGuard is a new benchmark suite for safety in computer-use agents, introduced by Mina Mohammadmirzaei and Jeffrey Flanigan in a paper submitted to arXiv on 13 Jun 2026 (arXiv:2606.15034). The suite aims to reveal cases where an agent achieves a nominal task goal by taking an unsafe shortcut rather than performing the task safely.
What OSGuard contains
OSGuard uses a dual-granularity design. First, an action-level benchmark presents contextualized proposed actions that are labeled as allowed, unrelated, or unsafe. Each label is assigned relative to the original user instruction and the current interface state. Second, a risk-augmented execution suite provides end-to-end evaluation using manually constructed OSWorld-derived task variants in which the original task remains achievable but the environment is modified to introduce latent hazards such as destructive overwrites.
The execution suite pairs each variant with augmented evaluators that retain the original task-success criterion while adding explicit state-based safety invariants. That pairing lets evaluators distinguish safe completions from unsafe completions that nevertheless satisfy the nominal task objective.
The paper frames the benchmark around a practical gap: computer-use agents are increasingly judged by whether they complete realistic desktop and web tasks, but task success alone can miss failures where an unsafe shortcut produces the same visible outcome.
Experimental findings
The authors report that current multimodal guardrails can perform well on isolated action judgments in the action-level benchmark. The paper also reports a contrasting result from the risk-augmented execution suite: these end-to-end, state-aware variants expose remaining gaps between local oversight and reliable full-task safety. In other words, good performance on individual action labels does not guarantee that guardrails will prevent unsafe task completions in modified environments that introduce latent hazards.
The dual-granularity design is presented as a diagnostic tool. The action-level benchmark tests whether models can recognize unsafe proposed actions in context. The risk-augmented execution suite tests whether the presence of such local assessments actually improves safety when the model is deployed as a guardrail during full tasks.
Why it matters
OSGuard reframes evaluation from a single task-success metric to a two-part question: can agents spot unsafe actions, and do those local judgments translate into safer end-to-end behavior when the environment changes? That distinction matters because real desktop and web environments frequently contain latent hazards that can transform a nominally successful sequence of actions into an unsafe outcome.
The benchmark provides a structured way to measure where safety mechanisms fail: at the recognition stage or at the integration stage when oversight must influence ongoing behavior. For researchers and developers building guardrails, the suite offers targeted diagnostics rather than a binary pass/fail on task completion.
What to watch
Watch for follow-up work that reports quantitative comparisons using OSGuard variants, specifically whether improvements on action-level labels yield measurable reductions in unsafe completions under the risk-augmented execution tests. Also watch for broader adoption of state-based safety invariants in evaluators for desktop and web task benchmarks.
The OSGuard paper and its dual-granularity design shift evaluation toward distinguishing safe completions from unsafe ones that nevertheless meet task objectives, making it easier to diagnose and close the gap between local oversight and end-to-end safety.
| Item | ||
|---|---|---|
| Primary purpose | Evaluate local guardrail decisions | End-to-end safety evaluation with environment variants |
| Granularity | Action-level | Task-level / execution |
| Labels or evaluators | Contextualized proposed actions labeled allowed, unrelated, or unsafe | Augmented evaluators that add explicit state-based safety invariants |
| Task objective | Judged relative to original instruction and current interface state | Original task remains achievable while hazards are introduced |
| Hazards introduced | Focus on unsafe proposed actions in context | Latent hazards such as destructive overwrites |
| What it measures | Model recognition of unsafe actions | Ability to distinguish safe completions from unsafe completions that satisfy the nominal objective |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in AI SafetyAI4SE and SE4AI: A decade review of AI in systems engineering
H. Sinan Bank, Daniel R. Herber and Thomas Bradley map three research phases and assess 1.
Deepmind AI Control Roadmap: agents treated as insider threats
Deepmind ties permissions to verified behavior, models agents as rogue employees.
Dario Amodei's AI playbook: Anthropic's regulation plan
Amodei urges binding third-party audits, federal power to block risky models, export controls.
Germany approves DE-AISI, an AI security institute based on UK
The National Security Council authorised a German AI Security Institute to test advanced models.