Multimodal AI4 min read

Gemini Omni demos: 9 Google I/O videos show multimodal power

Nine Google I/O videos show Gemini Omni using images and audio, and Gemini 3.5 solving coding tasks and assistant queries.

The Brieftide

TL;DR

  • 01Nine Google I/O videos show Gemini Omni using images and audio, and Gemini 3.5 solving coding tasks and assistant queries.
  • 02Google published nine demo videos after Google I/O 2026 that illustrate how Gemini Omni and Gemini 3.5 handle multimodal tasks and conversational workloads.
  • 03The clips, posted to the company blog and official channels, present discrete use cases ranging from image and audio understanding to code generation and agent-style task execution.

Google published nine demo videos after Google I/O 2026 that illustrate how Gemini Omni and Gemini 3.5 handle multimodal tasks and conversational workloads. The clips, posted to the company blog and official channels, present discrete use cases ranging from image and audio understanding to code generation and agent-style task execution.

What the videos show

The set of videos separates into two rough groups. Gemini Omni clips focus on multimodal interaction, combining images, video, and audio with conversational prompts. Examples include identifying objects in cluttered scenes, following a narrated walkthrough of a physical space, and answering questions about video content. One clip shows real-time annotation of a scene, with the model responding to both a live camera feed and spoken instructions.

Gemini 3.5 demos emphasize text and code capabilities. Several videos show stepwise problem solving in a chat interface, including debugging and generating code snippets, producing structured outputs, and integrating external tools for retrieval and computation. One example demonstrates a multi-step planning task where the model drafts a sequence of actions, queries a simulated calendar, and refines the plan after receiving new constraints.

Across the videos the interface variety is notable. Some demos run inside a chat console, others overlay model outputs on top of video playback or AR-style visuals. Inputs are often multimodal: screenshots or phone-camera footage plus spoken queries, or a developer prompt paired with a codebase snapshot. Each clip highlights a single capability, keeping interactions short and focused.

Technical highlights and limits

The demos illustrate two technical themes: broader input modality support and tighter tool integration. Gemini Omni is shown accepting images, short video clips, and audio cues within a single session, and combining those inputs to produce grounded answers. Gemini 3.5 is presented as an updated chat-and-code workhorse, with extended context handling for multi-step tasks and explicit tool calls for retrieval or execution.

The videos do not disclose behind-the-scenes deployment details, latency figures, or the compute used for each demo. Interactions appear curated and often rely on clean inputs: well-lit video, clear audio, and orderly code samples. That curation makes capabilities easier to demonstrate but limits evidence about robustness in noisy, real-world conditions. The clips also show built-in guardrails in some cases, for example model refusals on sensitive prompts, but do not provide a systematic assessment of failure modes.

Here is a concise comparison of the nine demos and what each illustrates:

Demo Model Modalities Demo focus Notable feature
Demo 1 Gemini Omni Image + text Object identification in cluttered scene Multimodal grounding
Demo 2 Gemini Omni Video + audio Follow narrated walkthrough and answer Qs Temporal video understanding
Demo 3 Gemini Omni Image + audio Annotate live camera feed with spoken instructions Real-time overlays
Demo 4 Gemini Omni Image + text Visual layout comprehension, extract details from photo Scene layout parsing
Demo 5 Gemini 3.5 Text Multi-turn chat solving a planning task Stepwise reasoning
Demo 6 Gemini 3.5 Code + text Generate and debug code snippets Iterative code refinement
Demo 7 Gemini 3.5 Text + tools Retrieval-augmented QA with simulated tool calls Tool integration
Demo 8 Gemini 3.5 Text Structured output generation for reports Template-aware generation
Demo 9 Gemini Omni + 3.5 Multimodal + tools Hybrid workflow combining visual input and code-oriented tools Cross-model orchestration

Why it matters

The videos make clear that Google is positioning multimodal models as practical interfaces for mixed visual, audio, and text tasks, while keeping a separate line for text-and-code-intensive assistant work. Developers and enterprises preparing to integrate multimodal inputs will need to evaluate robustness and privacy in real deployments, since the demos show capabilities but not real-world failure rates or infrastructure costs.

Summary of the nine demo videos
Item
Demo 1Gemini OmniImage + textObject identification in cluttered sceneMultimodal grounding
Demo 2Gemini OmniVideo + audioFollow narrated walkthrough and answer QsTemporal video understanding
Demo 3Gemini OmniImage + audioAnnotate live camera feed with spoken instructionsReal-time overlays
Demo 4Gemini OmniImage + textVisual layout comprehension, extract details from photoScene layout parsing
Demo 5Gemini 3.5TextMulti-turn chat solving a planning taskStepwise reasoning
Demo 6Gemini 3.5Code + textGenerate and debug code snippetsIterative code refinement
Demo 7Gemini 3.5Text + toolsRetrieval-augmented QA with simulated tool callsTool integration
Demo 8Gemini 3.5TextStructured output generation for reportsTemplate-aware generation
Demo 9Gemini Omni + 3.5Multimodal + toolsHybrid workflow combining visual input and code-oriented toolsCross-model orchestration
Advertisement

Written by The Brieftide · Source: Google AI

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement