Computer-Use Agents: RL with Autonomous Evaluation +12.6pp
A vision-language evaluator supplies terminal feedback for GUI agents.
TL;DR
- 01A vision-language evaluator supplies terminal feedback for GUI agents.
- 02The paper, submitted 23 June 2026 and accepted to GLOW @ IJCAI 2026, reports an average success-rate increase of 12.6 percentage points over zero-shot baselines.
- 03The paper uses a Vision-Language Model to judge task completion from a final screenshot and the original instruction, supplying terminal feedback without task-specific heuristics or manual labels.
Marta Sumyk and Oleksandr Kosovan propose an RL fine-tuning framework that uses autonomous vision-language evaluation as a scalable supervision signal for Computer-Use Agents, and they show that modeling evaluator noise boosts task success. The paper, submitted 23 June 2026 and accepted to GLOW @ IJCAI 2026, reports an average success-rate increase of 12.6 percentage points over zero-shot baselines.
How does autonomous evaluation work?
The paper uses a Vision-Language Model to judge task completion from a final screenshot and the original instruction, supplying terminal feedback without task-specific heuristics or manual labels. During policy optimization the evaluator provides a binary completion signal; the authors model that feedback as a "noisy binary reward channel" and derive a noise-corrected reward estimator for Proximal Policy Optimization.
Sumyk and Kosovan describe the pipeline in three practical pieces: a GUI agent that executes high-level user goals, a Vision-Language Model that scores final screenshots against instructions, and a noise-correction module that adjusts the evaluator signal before feeding it into PPO. The evaluator removes the need for handcrafted reward functions or dense manual labels by judging visually grounded task success.
What did the experiments measure and where?
The authors evaluate corrected evaluator rewards across three desktop-style environments: macOSWorld, Windows Agent Arena, and OSWorld, and compare against zero-shot baselines and raw evaluator fine-tuning. Corrected evaluator rewards improved success rates by an average of 12.6 percentage points over zero-shot performance and by 5.1 points over raw evaluator fine-tuning.
The paper emphasizes that autonomous evaluators are imperfect, which motivates the formal modeling of evaluator noise. The experiments show that explicitly correcting for that noise yields higher success than both leaving the evaluator uncorrected and not using the evaluator at all. The environments named in the study are macOSWorld, Windows Agent Arena, and OSWorld.
Why does this matter?
Autonomous, vision-language evaluation offers a scalable way to supervise GUI agents where task success is visually grounded and hard to specify with handcrafted rewards or dense labels. Modeling the evaluator as a noisy channel converts an imperfect but widely applicable signal into useful training feedback, which can raise success rates across diverse desktop environments without manual reward engineering.
This shifts some of the burden from environment-specific engineering to improved evaluator design and noise modeling. For researchers building agents that must act inside graphical interfaces, the study shows a path to reduce manual labeling while keeping training practical and demonstrably more effective than naive evaluator usage.
What are the limitations and the approach's scope?
The paper frames the evaluator as imperfect and focuses on terminal feedback from final screenshots rather than dense, stepwise rewards. The noise-correction is derived specifically for Proximal Policy Optimization; the reported gains are average improvements in success rates across the three named environments, not per-task breakdowns in the text provided.
What to watch
Look for follow-up work that applies the noise-corrected estimator to other RL algorithms or that reports per-task success breakdowns for macOSWorld, Windows Agent Arena, and OSWorld. Also watch for releases of the Vision-Language evaluators or code referenced under the paper's supplemental materials, which could enable replication and wider use.
References and provenance
The paper, "Reinforcement Learning for Computer-Use Agents with Autonomous Evaluation," was submitted to arXiv on 23 June 2026 by Marta Sumyk and Oleksandr Kosovan and accepted to the 4th International Workshop on Generalizing from Limited Resources in the Open World (GLOW @ IJCAI 2026). The main quantitative claims in this brief are taken from the paper: a 12.6 percentage point average improvement over zero-shot baselines and a 5.1 point gain over raw evaluator fine-tuning.
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Coding AgentsData2Story: CSV-to-article pipeline with seven AI agents
A Claude Code skill runs seven specialist agents to turn a CSV into a verifiable, interactive news article with an Inspector panel.
Vibe Coding: AI evaluation for greenfield software engineering
Callum Barbour's arXiv paper tests 'vibe coding' on isolated Python greenfield tasks using a custom evaluation suite.
CODA-BENCH benchmark: testing code agents on data tasks
CODA-BENCH places agents in a Kaggle-based Linux sandbox with 1,009 tasks across 31 communities and an average of 980 files per task.
Deep Agents + Bedrock AgentCore: context-rich research agents
LangChain Deep Agents delegates deep work to isolated subagents running in Amazon Bedrock AgentCore MicroVMs, combining browsers.