GUI vs CLI benchmark: 440 tasks, GUI 59.1% vs CLI 48.2%
A matched 440-task execution benchmark by Xiao Zhou et al finds GUI agents at 59.1% full pass, original CLI 48.2%.
TL;DR
- 01A matched 440-task execution benchmark by Xiao Zhou et al finds GUI agents at 59.1% full pass, original CLI 48.2%.
- 02Xiao Zhou and co‑authors submitted a matched execution-layer benchmark on 22 Jun 2026 that compares screen-only GUI agents with skill-mediated CLI agents across 440 desktop tasks.
- 03The paper evaluates both modalities with identical goals, initial states, and final-state verifiers to isolate modality-specific execution bottlenecks.
Xiao Zhou and co‑authors submitted a matched execution-layer benchmark on 22 Jun 2026 that compares screen-only GUI agents with skill-mediated CLI agents across 440 desktop tasks. The paper evaluates both modalities with identical goals, initial states, and final-state verifiers to isolate modality-specific execution bottlenecks.
What did the benchmark test?
The benchmark runs 440 desktop tasks across 18 applications and 12 workflow categories with identical goals, states, and final-state verifiers for both modalities. The dataset is designed so GUI agents are restricted to screen-native interactions and CLI agents to programmatic skill interfaces, removing confounds from differing tasks, initial conditions, verifiers, or allowed actions.
The matched design forces the two agent types to operate from the same starting points and targets. That lets the authors separate failures that arise from interaction modality from failures that arise because one side has different tools or permissions.
How did GUI and CLI perform?
The strongest GUI agent reached a 59.1% full pass rate, the strongest original-skill CLI agent reached a 48.2% full pass rate, and a verifier-guided skill augmentation raised CLI success to 69.3%. These three numbers are the paper's central quantitative findings.
The authors highlight that augmenting CLI agents with verifier-guided skills flips the ranking: verifier-guided CLI agents exceed the strongest GUI agent's full pass rate. The paper frames the raw 48.2% figure for original-skill CLI agents as not solely a model limitation, writing that "much of the CLI deficit comes from incomplete skill coverage rather than model capability alone."
Beyond the headline rates, the benchmark spans diverse applications and workflows so that long-horizon tasks and multi-step GUI interactions are part of the evaluation, stressing different failure modes for each modality.
Why does each modality fail differently?
GUI agents are limited by reliable grounded interaction over long-horizon workflows, while CLI agents are limited by the coverage and scalability of their skill interfaces. The paper positions these as distinct execution bottlenecks: GUI work depends on sustained, accurate manipulation of on-screen elements across many steps; CLI work depends on whether the available skill set can express the needed operations and whether that skill set scales to broad desktop functionality.
The verifier-guided improvement for CLI agents demonstrates that expanding or better coordinating skills can overcome a large portion of the CLI gap. Conversely, the GUI agents’ remaining failures point to the difficulty of robust, grounded control in extended GUI sequences despite having modality-native actions.
What to watch
Look for follow-up work that expands skill coverage for CLI agents or that strengthens long-horizon grounding for GUI agents. The paper shows verifier-guided skill augmentation as a concrete lever: increases in skill coverage or verification mechanisms would confirm whether the CLI bottleneck is primarily an engineering scope problem rather than a model capability ceiling.
Improvements in GUI interaction primitives and in methods for maintaining reliable state across multi-step workflows would validate whether the GUI bottleneck can be closed without changing modality.
The paper, arXiv:2606.24551, supplies the matched 440-task benchmark, the 18-application and 12-category scope, and the three headline pass rates (59.1%, 48.2%, 69.3%) that define these next milestones.
| Item | |||
|---|---|---|---|
| Strongest GUI agent | GUI | 59.1 | |
| Strongest original-skill CLI agent | CLI (original skills) | 48.2 | |
| CLI with verifier-guided skill augmentation | CLI (verifier-guided skills) | 69.3 |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Coding AgentsData2Story: CSV-to-article pipeline with seven AI agents
A Claude Code skill runs seven specialist agents to turn a CSV into a verifiable, interactive news article with an Inspector panel.
Vibe Coding: AI evaluation for greenfield software engineering
Callum Barbour's arXiv paper tests 'vibe coding' on isolated Python greenfield tasks using a custom evaluation suite.
CODA-BENCH benchmark: testing code agents on data tasks
CODA-BENCH places agents in a Kaggle-based Linux sandbox with 1,009 tasks across 31 communities and an average of 980 files per task.
Deep Agents + Bedrock AgentCore: context-rich research agents
LangChain Deep Agents delegates deep work to isolated subagents running in Amazon Bedrock AgentCore MicroVMs, combining browsers.