EO-Agents: Three-Agent LLM Pipeline for NASA EO Hypotheses
A three-agent LLM pipeline grounded in the NASA Earth Observation Knowledge Graph generated 160 hypotheses from 1,475 datasets.
TL;DR
- 01A three-agent LLM pipeline grounded in the NASA Earth Observation Knowledge Graph generated 160 hypotheses from 1,475 datasets.
- 02The paper, by Mahyar Ghazanfari, Amin Tabrizian, Armin Mehrabian and Peng Wei, was submitted on 2 Jul 2026 and accepted at the ICML 2026 AI for Science Workshop.
- 03The paper describes applying this pipeline across 1,475 NASA datasets.
EO-Agents is a three-agent LLM pipeline that grounds Earth observation hypothesis generation in the NASA Earth Observation Knowledge Graph; applied to 1,475 NASA datasets, the system produced 160 hypotheses across multiple Earth-science domains. The paper, by Mahyar Ghazanfari, Amin Tabrizian, Armin Mehrabian and Peng Wei, was submitted on 2 Jul 2026 and accepted at the ICML 2026 AI for Science Workshop.
How does EO-Agents work?
EO-Agents first ranks dataset pairings with a heterogeneous graph neural network trained on historical co-usage relations, then runs a three-agent LLM pipeline that filters, generates, and evaluates structured research hypotheses. The pipeline is explicitly grounded in the NASA Earth Observation Knowledge Graph; the authors pair that knowledge-graph grounding with a GNN to surface candidate dataset combinations before applying LLM agents to produce and assess hypotheses.
The paper describes applying this pipeline across 1,475 NASA datasets. The three LLM agents play distinct roles: one filters candidate pairings, another generates structured hypotheses, and the third evaluates those hypotheses, producing an output set intended to be scientifically coherent and tied to data sources rather than only free-form literature claims.
What did EO-Agents produce and how was it evaluated?
The system produced 160 hypotheses spanning domains including ecohydrology, glaciology, aerosol--cloud interactions, vegetation phenology, and stratospheric chemistry. The authors report that model-predicted novel dataset pairings were rated nearly as plausible as held-out real co-usages drawn from the literature, indicating the pipeline surfaces coherent yet unexplored combinations.
Evaluation included a 222 factorial experiment across GPT-5.2 and Claude Sonnet 4.6; the authors found hypothesis rankings remained stable across conditions, while absolute scores depended strongly on judge identity. That result highlights limits of single-judge LLM evaluation and suggests human or cross-model variability strongly affects perceived hypothesis quality.
Why it matters
Grounding hypothesis generation in a structured Earth-observation knowledge graph, rather than relying only on unstructured literature, shifts the workflow toward dataset-aware scientific ideation. Combining a GNN trained on historical co-usage with LLM agents makes the search for novel dataset pairings systematic and traceable to concrete data sources. The paper also flags an evaluation issue: while rankings were stable, score magnitudes varied with judge identity, which matters for any automated hypothesis-ranking system intended to inform scientists.
What to watch
Look for the ICML 2026 AI for Science Workshop presentation and any companion code or data releases tied to the paper, and for follow-up work that tests the pipeline with additional judge pools or alternative model families to address the reported judge-dependence in absolute scores.
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Coding AgentsAgent4cs: Multi-agent code summarization, up to 38% gains
Agent4cs uses three cooperating agents to summarize large hierarchical codebases.
Autoformalization: Agent Instructions to Policy-as-Code
A pipeline that uses an LLM generator-critic loop to turn prompts and policy text into Cedar policies, submitted 25 Jun 2026.
Agentic Analysis: LLM Pipeline compares ERC-8004 and Google A2A
An LLM-powered pipeline analyzes 4,323 governance participation records across ERC-8004 (permissionless.
Data2Story: CSV-to-article pipeline with seven AI agents
A Claude Code skill runs seven specialist agents to turn a CSV into a verifiable, interactive news article with an Inspector panel.