AI SafetyJune 16, 20265 min read

OSGuard benchmark: safety tests for computer-use agents

A dual-granularity suite with action-level labels and risk-augmented executions to reveal unsafe shortcuts in agents.

The BrieftideJune 16, 2026

TL;DR

01A dual-granularity suite with action-level labels and risk-augmented executions to reveal unsafe shortcuts in agents.
02The suite aims to reveal cases where an agent achieves a nominal task goal by taking an unsafe shortcut rather than performing the task safely.
03First, an action-level benchmark presents contextualized proposed actions that are labeled as allowed, unrelated, or unsafe.

OSGuard is a new benchmark suite for safety in computer-use agents, introduced by Mina Mohammadmirzaei and Jeffrey Flanigan in a paper submitted to arXiv on 13 Jun 2026 (arXiv:2606.15034). The suite aims to reveal cases where an agent achieves a nominal task goal by taking an unsafe shortcut rather than performing the task safely.

What OSGuard contains

OSGuard uses a dual-granularity design. First, an action-level benchmark presents contextualized proposed actions that are labeled as allowed, unrelated, or unsafe. Each label is assigned relative to the original user instruction and the current interface state. Second, a risk-augmented execution suite provides end-to-end evaluation using manually constructed OSWorld-derived task variants in which the original task remains achievable but the environment is modified to introduce latent hazards such as destructive overwrites.

The execution suite pairs each variant with augmented evaluators that retain the original task-success criterion while adding explicit state-based safety invariants. That pairing lets evaluators distinguish safe completions from unsafe completions that nevertheless satisfy the nominal task objective.

The paper frames the benchmark around a practical gap: computer-use agents are increasingly judged by whether they complete realistic desktop and web tasks, but task success alone can miss failures where an unsafe shortcut produces the same visible outcome.

Experimental findings

The authors report that current multimodal guardrails can perform well on isolated action judgments in the action-level benchmark. The paper also reports a contrasting result from the risk-augmented execution suite: these end-to-end, state-aware variants expose remaining gaps between local oversight and reliable full-task safety. In other words, good performance on individual action labels does not guarantee that guardrails will prevent unsafe task completions in modified environments that introduce latent hazards.

The dual-granularity design is presented as a diagnostic tool. The action-level benchmark tests whether models can recognize unsafe proposed actions in context. The risk-augmented execution suite tests whether the presence of such local assessments actually improves safety when the model is deployed as a guardrail during full tasks.

Why it matters

OSGuard reframes evaluation from a single task-success metric to a two-part question: can agents spot unsafe actions, and do those local judgments translate into safer end-to-end behavior when the environment changes? That distinction matters because real desktop and web environments frequently contain latent hazards that can transform a nominally successful sequence of actions into an unsafe outcome.

The benchmark provides a structured way to measure where safety mechanisms fail: at the recognition stage or at the integration stage when oversight must influence ongoing behavior. For researchers and developers building guardrails, the suite offers targeted diagnostics rather than a binary pass/fail on task completion.

What to watch

Watch for follow-up work that reports quantitative comparisons using OSGuard variants, specifically whether improvements on action-level labels yield measurable reductions in unsafe completions under the risk-augmented execution tests. Also watch for broader adoption of state-based safety invariants in evaluators for desktop and web task benchmarks.

The OSGuard paper and its dual-granularity design shift evaluation toward distinguishing safe completions from unsafe ones that nevertheless meet task objectives, making it easier to diagnose and close the gap between local oversight and end-to-end safety.

OSGuard: action-level benchmark vs risk-augmented execution suite

Item
Primary purpose	Evaluate local guardrail decisions	End-to-end safety evaluation with environment variants
Granularity	Action-level	Task-level / execution
Labels or evaluators	Contextualized proposed actions labeled allowed, unrelated, or unsafe	Augmented evaluators that add explicit state-based safety invariants
Task objective	Judged relative to original instruction and current interface state	Original task remains achievable while hazards are introduced
Hazards introduced	Focus on unsafe proposed actions in context	Latent hazards such as destructive overwrites
What it measures	Model recognition of unsafe actions	Ability to distinguish safe completions from unsafe completions that satisfy the nominal objective

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

AI4SE and SE4AI: A decade review of AI in systems engineering

H. Sinan Bank, Daniel R. Herber and Thomas Bradley map three research phases and assess 1.

The BrieftideDAILY BRIEF

Deepmind AI Control Roadmap: agents treated as insider threats

Deepmind ties permissions to verified behavior, models agents as rogue employees.

The BrieftideDAILY BRIEF

Dario Amodei's AI playbook: Anthropic's regulation plan

Amodei urges binding third-party audits, federal power to block risky models, export controls.

The BrieftideDAILY BRIEF

Germany approves DE-AISI, an AI security institute based on UK

The National Security Council authorised a German AI Security Institute to test advanced models.