NAVI-Orbital first in-orbit demo of Gemma 3 for Earth observation
NAVI-Orbital ran a local Gemma 3 VLM aboard a LEO spacecraft on April 16, 2026.
TL;DR
- 01NAVI-Orbital ran a local Gemma 3 VLM aboard a LEO spacecraft on April 16, 2026.
- 02NAVI-Orbital achieved the first in-orbit demonstration of a zero-shot vision-language model performing autonomous multi-modal inference entirely onboard a Low Earth Orbit spacecraft on April 16, 2026.
- 03The system used a local Gemma 3 model to classify scenes, generate textual descriptions, and answer operator follow-up via natural-language dialogue.
NAVI-Orbital achieved the first in-orbit demonstration of a zero-shot vision-language model performing autonomous multi-modal inference entirely onboard a Low Earth Orbit spacecraft on April 16, 2026. The system used a local Gemma 3 model to classify scenes, generate textual descriptions, and answer operator follow-up via natural-language dialogue.
What did NAVI-Orbital demonstrate?
NAVI-Orbital demonstrated running foundation-model inference onboard a LEO spacecraft without any in-orbit fine-tuning, producing scene classifications, textual descriptions, and dialogic responses to plain-English prompts. The authors state the flight run included live captures of previously unseen Earth imagery, including uncorrected YAM-9 imagery, processed onboard with hardware-accelerated GPU inference and no fine-tuning for the flight instrument.
The paper cites ground benchmarking results of 88.16% accuracy on the 7,960-image curated AID benchmark, plus Flatsat validation before the live in-orbit captures. The arXiv submission lists the demonstration date and situates the work as, to the authors' knowledge, the first such in-orbit result.
How does the system work?
NAVI-Orbital uses a graph-based state machine called LangGraph to coordinate dedicated agents for detection and dialogue, with a local vision-language model (Gemma 3) performing zero-shot inference. Sensor imagery is captured by the spacecraft instrument, forwarded to an onboard GPU for hardware-accelerated inference, then consumed by Gemma 3 and LangGraph-managed agents to produce labels, descriptions, and natural-language replies.
The software replaces conventional command sequences with plain-English prompts to re-task the system. The pipeline includes detection and dialogue agents orchestrated by LangGraph, enabling the model to classify scenes and respond to operator follow-up. The authors detail ground benchmarking, Flatsat validation, and live in-orbit captures as the validation stages used in the demonstration.
Why it matters
Running a vision-language foundation model onboard shifts the data flow by creating semantic outputs in place of downlinking all raw imagery. The authors frame this as a way to invert the conventional acquire-then-downlink-everything bandwidth profile through semantic compression of Earth observations. That matters for missions where downlink bandwidth and human-in-the-loop processing are bottlenecks, because semantic summaries and queryable dialogue reduce the volume of data that must be sent to ground stations.
The 88.16% accuracy on a 7,960-image AID benchmark gives one concrete performance point for the zero-shot approach in a curated test, and the system was validated in Flatsat and with live in-orbit captures including uncorrected YAM-9 imagery, showing the software pipeline ran end-to-end on satellite-class edge hardware.
What to watch
Watch for follow-up work that reports on operational bandwidth savings, the size and latency of onboard semantic outputs versus raw downlinks, and any public releases of code or models tied to the demonstration. Also look for independent replication of the claim that Gemma 3 can be run zero-shot onboard a flight instrument without fine-tuning under real mission conditions.
Authors and provenance: the paper is by Juan Manuel Delfa Victoria, Taran Cyriac John, and Andrew W. Herson, submitted to arXiv on 5 Jun 2026 as arXiv:2606.18271. The preprint is 17 pages with 47 figures and describes ground benchmark figures, Flatsat tests, and the in-orbit demonstration on April 16, 2026.
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIVisual-Seeker: visual-native multimodal search surpasses rivals
Zhengbo Zhang and 12 co-authors submitted Visual-Seeker on 13 Jun 2026.
Gemma 4 12B: unified, encoder-free multimodal model for laptops
Google DeepMind’s 12B model brings encoder-free vision and native audio to laptops, runs on 16GB memory and is released under Apache 2.0.
Hugging Face Spaces agents.md: chain image to 3D splats
An agent used two Hugging Face Spaces and their agents.md files to auto-generate images, reconstruct 3D Gaussian splats.
LLM Research Papers 2026 (Jan–May): Curated list and trends
Sebastian Raschka assembled a curated list of LLM papers bookmarked from January through May 2026.