Gemini Omni demos: 9 Google I/O videos show multimodal power
Nine Google I/O videos show Gemini Omni using images and audio, and Gemini 3.5 solving coding tasks and assistant queries.
TL;DR
- 01Nine Google I/O videos show Gemini Omni using images and audio, and Gemini 3.5 solving coding tasks and assistant queries.
- 02Google published nine demo videos after Google I/O 2026 that illustrate how Gemini Omni and Gemini 3.5 handle multimodal tasks and conversational workloads.
- 03The clips, posted to the company blog and official channels, present discrete use cases ranging from image and audio understanding to code generation and agent-style task execution.
Google published nine demo videos after Google I/O 2026 that illustrate how Gemini Omni and Gemini 3.5 handle multimodal tasks and conversational workloads. The clips, posted to the company blog and official channels, present discrete use cases ranging from image and audio understanding to code generation and agent-style task execution.
What the videos show
The set of videos separates into two rough groups. Gemini Omni clips focus on multimodal interaction, combining images, video, and audio with conversational prompts. Examples include identifying objects in cluttered scenes, following a narrated walkthrough of a physical space, and answering questions about video content. One clip shows real-time annotation of a scene, with the model responding to both a live camera feed and spoken instructions.
Gemini 3.5 demos emphasize text and code capabilities. Several videos show stepwise problem solving in a chat interface, including debugging and generating code snippets, producing structured outputs, and integrating external tools for retrieval and computation. One example demonstrates a multi-step planning task where the model drafts a sequence of actions, queries a simulated calendar, and refines the plan after receiving new constraints.
Across the videos the interface variety is notable. Some demos run inside a chat console, others overlay model outputs on top of video playback or AR-style visuals. Inputs are often multimodal: screenshots or phone-camera footage plus spoken queries, or a developer prompt paired with a codebase snapshot. Each clip highlights a single capability, keeping interactions short and focused.
Technical highlights and limits
The demos illustrate two technical themes: broader input modality support and tighter tool integration. Gemini Omni is shown accepting images, short video clips, and audio cues within a single session, and combining those inputs to produce grounded answers. Gemini 3.5 is presented as an updated chat-and-code workhorse, with extended context handling for multi-step tasks and explicit tool calls for retrieval or execution.
The videos do not disclose behind-the-scenes deployment details, latency figures, or the compute used for each demo. Interactions appear curated and often rely on clean inputs: well-lit video, clear audio, and orderly code samples. That curation makes capabilities easier to demonstrate but limits evidence about robustness in noisy, real-world conditions. The clips also show built-in guardrails in some cases, for example model refusals on sensitive prompts, but do not provide a systematic assessment of failure modes.
Here is a concise comparison of the nine demos and what each illustrates:
| Demo | Model | Modalities | Demo focus | Notable feature |
|---|---|---|---|---|
| Demo 1 | Gemini Omni | Image + text | Object identification in cluttered scene | Multimodal grounding |
| Demo 2 | Gemini Omni | Video + audio | Follow narrated walkthrough and answer Qs | Temporal video understanding |
| Demo 3 | Gemini Omni | Image + audio | Annotate live camera feed with spoken instructions | Real-time overlays |
| Demo 4 | Gemini Omni | Image + text | Visual layout comprehension, extract details from photo | Scene layout parsing |
| Demo 5 | Gemini 3.5 | Text | Multi-turn chat solving a planning task | Stepwise reasoning |
| Demo 6 | Gemini 3.5 | Code + text | Generate and debug code snippets | Iterative code refinement |
| Demo 7 | Gemini 3.5 | Text + tools | Retrieval-augmented QA with simulated tool calls | Tool integration |
| Demo 8 | Gemini 3.5 | Text | Structured output generation for reports | Template-aware generation |
| Demo 9 | Gemini Omni + 3.5 | Multimodal + tools | Hybrid workflow combining visual input and code-oriented tools | Cross-model orchestration |
Why it matters
The videos make clear that Google is positioning multimodal models as practical interfaces for mixed visual, audio, and text tasks, while keeping a separate line for text-and-code-intensive assistant work. Developers and enterprises preparing to integrate multimodal inputs will need to evaluate robustness and privacy in real deployments, since the demos show capabilities but not real-world failure rates or infrastructure costs.
| Item | |||||
|---|---|---|---|---|---|
| Demo 1 | Gemini Omni | Image + text | Object identification in cluttered scene | Multimodal grounding | |
| Demo 2 | Gemini Omni | Video + audio | Follow narrated walkthrough and answer Qs | Temporal video understanding | |
| Demo 3 | Gemini Omni | Image + audio | Annotate live camera feed with spoken instructions | Real-time overlays | |
| Demo 4 | Gemini Omni | Image + text | Visual layout comprehension, extract details from photo | Scene layout parsing | |
| Demo 5 | Gemini 3.5 | Text | Multi-turn chat solving a planning task | Stepwise reasoning | |
| Demo 6 | Gemini 3.5 | Code + text | Generate and debug code snippets | Iterative code refinement | |
| Demo 7 | Gemini 3.5 | Text + tools | Retrieval-augmented QA with simulated tool calls | Tool integration | |
| Demo 8 | Gemini 3.5 | Text | Structured output generation for reports | Template-aware generation | |
| Demo 9 | Gemini Omni + 3.5 | Multimodal + tools | Hybrid workflow combining visual input and code-oriented tools | Cross-model orchestration |
Written by The Brieftide · Source: Google AI
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIDeepMind Gemma 4 12B release - encoder-free decoder-only LLM
A 12B-parameter Gemma 4 variant removes the separate visual encoder, processing text and images with a single decoder-only model.
Hugging Face Spaces: Multimedia Building Blocks demo
Hugging Face Spaces project assembles modular components to prototype multimodal agents handling text, images, audio and video.
2026 LLM Research Roundup Jan-May: Alignment, RAG, Multimodal
Curated highlights from Jan–May 2026 covering alignment, retrieval-augmented models, multimodal advances, evaluation, and efficiency.
Qwen3.7-Plus by Alibaba: multimodal autonomous agent
Combines visual perception, GUI control and code generation in one multimodal agent loop for extended task automation.