Coding Agents4 min read

GUI vs CLI benchmark: 440 tasks, GUI 59.1% vs CLI 48.2%

A matched 440-task execution benchmark by Xiao Zhou et al finds GUI agents at 59.1% full pass, original CLI 48.2%.

The Brieftide

TL;DR

  • 01A matched 440-task execution benchmark by Xiao Zhou et al finds GUI agents at 59.1% full pass, original CLI 48.2%.
  • 02Xiao Zhou and co‑authors submitted a matched execution-layer benchmark on 22 Jun 2026 that compares screen-only GUI agents with skill-mediated CLI agents across 440 desktop tasks.
  • 03The paper evaluates both modalities with identical goals, initial states, and final-state verifiers to isolate modality-specific execution bottlenecks.

Xiao Zhou and co‑authors submitted a matched execution-layer benchmark on 22 Jun 2026 that compares screen-only GUI agents with skill-mediated CLI agents across 440 desktop tasks. The paper evaluates both modalities with identical goals, initial states, and final-state verifiers to isolate modality-specific execution bottlenecks.

What did the benchmark test?

The benchmark runs 440 desktop tasks across 18 applications and 12 workflow categories with identical goals, states, and final-state verifiers for both modalities. The dataset is designed so GUI agents are restricted to screen-native interactions and CLI agents to programmatic skill interfaces, removing confounds from differing tasks, initial conditions, verifiers, or allowed actions.

The matched design forces the two agent types to operate from the same starting points and targets. That lets the authors separate failures that arise from interaction modality from failures that arise because one side has different tools or permissions.

How did GUI and CLI perform?

The strongest GUI agent reached a 59.1% full pass rate, the strongest original-skill CLI agent reached a 48.2% full pass rate, and a verifier-guided skill augmentation raised CLI success to 69.3%. These three numbers are the paper's central quantitative findings.

The authors highlight that augmenting CLI agents with verifier-guided skills flips the ranking: verifier-guided CLI agents exceed the strongest GUI agent's full pass rate. The paper frames the raw 48.2% figure for original-skill CLI agents as not solely a model limitation, writing that "much of the CLI deficit comes from incomplete skill coverage rather than model capability alone."

Beyond the headline rates, the benchmark spans diverse applications and workflows so that long-horizon tasks and multi-step GUI interactions are part of the evaluation, stressing different failure modes for each modality.

Why does each modality fail differently?

GUI agents are limited by reliable grounded interaction over long-horizon workflows, while CLI agents are limited by the coverage and scalability of their skill interfaces. The paper positions these as distinct execution bottlenecks: GUI work depends on sustained, accurate manipulation of on-screen elements across many steps; CLI work depends on whether the available skill set can express the needed operations and whether that skill set scales to broad desktop functionality.

The verifier-guided improvement for CLI agents demonstrates that expanding or better coordinating skills can overcome a large portion of the CLI gap. Conversely, the GUI agents’ remaining failures point to the difficulty of robust, grounded control in extended GUI sequences despite having modality-native actions.

What to watch

Look for follow-up work that expands skill coverage for CLI agents or that strengthens long-horizon grounding for GUI agents. The paper shows verifier-guided skill augmentation as a concrete lever: increases in skill coverage or verification mechanisms would confirm whether the CLI bottleneck is primarily an engineering scope problem rather than a model capability ceiling.

Improvements in GUI interaction primitives and in methods for maintaining reliable state across multi-step workflows would validate whether the GUI bottleneck can be closed without changing modality.

The paper, arXiv:2606.24551, supplies the matched 440-task benchmark, the 18-application and 12-category scope, and the three headline pass rates (59.1%, 48.2%, 69.3%) that define these next milestones.

Full pass rates by agent configuration
Item
Strongest GUI agentGUI59.1
Strongest original-skill CLI agentCLI (original skills)48.2
CLI with verifier-guided skill augmentationCLI (verifier-guided skills)69.3
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement