Guava paper: Universal harness for embodied manipulation, 4B model
Guava identifies three design ingredients and distills embodied manipulation into a 4B open-source model using fewer than 2K simulated.
TL;DR
- 01Guava identifies three design ingredients and distills embodied manipulation into a 4B open-source model using fewer than 2K simulated.
- 02Guava is a harness framework that structures how language-capable reasoning models use external perception, planning, and control modules for embodied manipulation.
- 03The paper identifies three core ingredients for effectiveness: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations.
Guava, a harness framework for embodied tool use, appears as an arXiv paper submitted on 16 June 2026 (arXiv:2606.18363) by Haowen Liu, Xirui Li, Shaoxiong Yao, Peng Shi, Tianyi Zhou, Jia-Bin Huang, Furong Huang, and Jiayuan Mao. The paper presents a design-space study and a training pipeline that distills embodied manipulation capabilities into a 4B open-source model using fewer than 2K trajectories collected entirely in simulation, and reports comparable performance to frontier proprietary models in both simulation and real-world tests.
What is Guava?
Guava is a harness framework that structures how language-capable reasoning models use external perception, planning, and control modules for embodied manipulation. The paper identifies three core ingredients for effectiveness: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. The authors frame Guava as a universal, model-agnostic interface aimed at combining high-level reasoning with dedicated embodied tools rather than relying on end-to-end vision-language-action training.
The framework is developed through a systematic exploration of the design space covering agent workflows, action spaces, and observation spaces. The study asks which workflow and representation choices unlock embodied capabilities across a range of reasoning models, and presents Guava as the set of design decisions that consistently worked best in their experiments.
How did the authors train a 4B open-source model with so little data?
The paper describes an end-to-end training pipeline that distills embodied manipulation skills into a compact 4B open-source model using fewer than 2K trajectories, all collected in simulation. That pipeline compresses interaction data and the harnessed tool behaviors into the model so a relatively small parameter count can exhibit strong embodied performance.
The authors emphasize iterative loops between perception, reasoning, and action, and use semantic action abstractions and multimodal observations as the representational backbone during distillation. Experimental evaluation spans both simulated environments and real-world robot tests, where the distilled 4B model showed performance comparable to frontier proprietary models while generalizing to unseen objects, novel instructions, and long-horizon tasks. The paper positions this approach as evidence that a well-designed harness can yield emergent embodied capabilities in compact open-source models with minimal training data.
How do the experiments support the claim of "effective and universal"?
Guava's empirical claims rest on simulation and real-world experiments where the harnessed models matched the performance of leading proprietary systems and generalized across new scenarios. The authors report that the distilled 4B model, trained on under 2K simulated trajectories, achieved comparable results to frontier proprietary models and demonstrated strong generalization to unseen objects, novel instructions, and long-horizon tasks. The paper uses those cross-domain tests to argue the harness design, rather than sheer model scale or dataset size, is a primary enabler of embodied capability.
The study explores variations in agent workflows, action spaces, and observation modalities to isolate what matters. From those experiments the authors extracted the three key ingredients and packaged them as Guava, arguing the principles are effective even for smaller models when combined with a focused distillation pipeline.
Why it matters
Guava challenges the assumption that large-scale end-to-end training alone is necessary for high-performing embodied agents. By showing that a 4B open-source model can be distilled to perform competitively using fewer than 2K simulated trajectories, the work implies that careful system design and interfaces between reasoning models and embodied tools can unlock capabilities efficiently. That lowers the barrier for research groups and practitioners who cannot train huge end-to-end systems but can implement harnessed tool use and distillation pipelines.
What to watch
Watch for replication of the distillation pipeline on other open-source models and for broader benchmarks comparing harnessed compact models against large end-to-end systems. The authors’ choices about workflows, action abstractions, and multimodal observations are the concrete signals that will determine whether Guava’s approach generalizes across tasks and robot platforms.
References and provenance: the paper "Guava: An Effective and Universal Harness for Embodied Manipulation" (arXiv:2606.18363) was submitted 16 June 2026 by Haowen Liu, Xirui Li, Shaoxiong Yao, Peng Shi, Tianyi Zhou, Jia-Bin Huang, Furong Huang, and Jiayuan Mao.
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AILLMs: gpt-4o, gpt-4.1-mini and claude-sonnet-4.6 study
Analysis of 21,000 multi-turn conversations finds human-like behaviors vary by model and user and can be modulated by system prompts.
ThinkDeception: Progressive RL framework for multimodal deception
ThinkDeception on arXiv uses MLLMs, a step-by-step multimodal Chain of Thought dataset and a four-tier progressive RL trainer for.
Visual-Seeker: visual-native multimodal search surpasses rivals
Zhengbo Zhang and 12 co-authors submitted Visual-Seeker on 13 Jun 2026.
Gemma 4 12B: unified, encoder-free multimodal model for laptops
Google DeepMind’s 12B model brings encoder-free vision and native audio to laptops, runs on 16GB memory and is released under Apache 2.0.