5 min read

Google Gemini 3.5 Flash: computer control and OSWorld 78.4

Gemini 3.5 Flash can view and operate your screen and scores 78.4 on the OSWorld benchmark.

The Brieftide

TL;DR

  • 01Gemini 3.5 Flash can view and operate your screen and scores 78.4 on the OSWorld benchmark.
  • 02Google has integrated the "Computer Use" capability directly into Gemini 3.5 Flash, enabling the model to see, understand, and interact with computers, browsers, and mobile devices on its own.
  • 03The change was announced on Jun 25, 2026, and the feature is available through the Gemini API and the Gemini Enterprise Agent Platform.

Google has integrated the "Computer Use" capability directly into Gemini 3.5 Flash, enabling the model to see, understand, and interact with computers, browsers, and mobile devices on its own. The change was announced on Jun 25, 2026, and the feature is available through the Gemini API and the Gemini Enterprise Agent Platform.

What changed in Gemini 3.5 Flash?

Gemini 3.5 Flash now embeds the feature Google labels "Computer Use", so the model can autonomously view and operate screens across browser, mobile, and desktop environments. Previously, that functionality existed only as a separate Gemini 2.5 model; integrating it into 3.5 Flash lets developers combine it with function calls, Search, and Maps to build cross-environment agents for tasks like software testing or office automation.

Gemini 3.5 Flash also ships with developer-facing assets: a Browserbase demo and a GitHub reference implementation to illustrate integrations. Google points users toward a best practices documentation page and the Gemini Enterprise Agent Platform for enterprise deployments.

How does Gemini 3.5 Flash compare on benchmarks?

On the OSWorld benchmark, Gemini 3.5 Flash scores 78.4, ahead of Gemini 3 Flash at 65.1 and GPT-5.4 mini at 72.1, though GPT-5.5 scores 78.7 and Anthropic's Opus 4.8 leads at 83.4. Sonnet 4.6 also hits 78.4, while Gemini 3.1 Pro scores 76.2. These published OSWorld numbers position Gemini 3.5 Flash near the top of the tested models, but not the highest scorer.

The primary source lists exact OSWorld scores for each model, so developers and evaluators can compare raw outcomes: Gemini 3.5 Flash 78.4, Gemini 3 Flash 65.1, GPT-5.4 mini 72.1, GPT-5.5 78.7, Opus 4.8 83.4, Sonnet 4.6 78.4, Gemini 3.1 Pro 76.2.

What safety and enterprise controls are included?

Google says it uses adversarial training to reduce susceptibility to prompt injection attacks and offers two optional enterprise safeguards. One safeguard requires user confirmation before sensitive or irreversible actions. The other automatically stops tasks when it detects indirect prompt injections. Google additionally recommends sandboxing, human oversight, and strict access controls and points to detailed guidance in its best practices documentation.

These controls are described as optional enterprise additions, and the feature set is routed through the Gemini API and the Gemini Enterprise Agent Platform for integration into existing systems.

Why it matters

Embedding screen-level control inside a mainstream model lowers the engineering friction of building agents that interact directly with user interfaces. Developers no longer need to stitch a separate model for computer control to higher-level logic; they can pair Gemini 3.5 Flash with function calls, Search, and Maps to automate multi-step, cross-platform tasks such as software testing or office automation. The OSWorld score of 78.4 places Gemini 3.5 Flash competitively among high-performing models, which matters to teams weighing model choice for interactive automation.

The availability of demos and a reference implementation also accelerates experimentation, while the enterprise safeguards and recommendations acknowledge the new attack surface created by giving an LLM live control over interfaces.

What to watch

Watch adoption signals in developer tooling and demos built on the Gemini API and the Gemini Enterprise Agent Platform, and track whether enterprises enable the optional safeguards — especially the user-confirmation and automatic-stop options — for production agents. Also watch OSWorld scores from competing models to see if the narrow gaps around the high 70s shift after real-world deployments.

Advertisement

Written by The Brieftide · Source: The Decoder

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement