Multimodal AIJune 18, 20265 min read

Guava paper: Universal harness for embodied manipulation, 4B model

Guava identifies three design ingredients and distills embodied manipulation into a 4B open-source model using fewer than 2K simulated.

The BrieftideJune 18, 2026

TL;DR

01Guava identifies three design ingredients and distills embodied manipulation into a 4B open-source model using fewer than 2K simulated.
02Guava is a harness framework that structures how language-capable reasoning models use external perception, planning, and control modules for embodied manipulation.
03The paper identifies three core ingredients for effectiveness: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations.

Guava, a harness framework for embodied tool use, appears as an arXiv paper submitted on 16 June 2026 (arXiv:2606.18363) by Haowen Liu, Xirui Li, Shaoxiong Yao, Peng Shi, Tianyi Zhou, Jia-Bin Huang, Furong Huang, and Jiayuan Mao. The paper presents a design-space study and a training pipeline that distills embodied manipulation capabilities into a 4B open-source model using fewer than 2K trajectories collected entirely in simulation, and reports comparable performance to frontier proprietary models in both simulation and real-world tests.

What is Guava?

Guava is a harness framework that structures how language-capable reasoning models use external perception, planning, and control modules for embodied manipulation. The paper identifies three core ingredients for effectiveness: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. The authors frame Guava as a universal, model-agnostic interface aimed at combining high-level reasoning with dedicated embodied tools rather than relying on end-to-end vision-language-action training.

The framework is developed through a systematic exploration of the design space covering agent workflows, action spaces, and observation spaces. The study asks which workflow and representation choices unlock embodied capabilities across a range of reasoning models, and presents Guava as the set of design decisions that consistently worked best in their experiments.

How did the authors train a 4B open-source model with so little data?

The paper describes an end-to-end training pipeline that distills embodied manipulation skills into a compact 4B open-source model using fewer than 2K trajectories, all collected in simulation. That pipeline compresses interaction data and the harnessed tool behaviors into the model so a relatively small parameter count can exhibit strong embodied performance.

The authors emphasize iterative loops between perception, reasoning, and action, and use semantic action abstractions and multimodal observations as the representational backbone during distillation. Experimental evaluation spans both simulated environments and real-world robot tests, where the distilled 4B model showed performance comparable to frontier proprietary models while generalizing to unseen objects, novel instructions, and long-horizon tasks. The paper positions this approach as evidence that a well-designed harness can yield emergent embodied capabilities in compact open-source models with minimal training data.

How do the experiments support the claim of "effective and universal"?

Guava's empirical claims rest on simulation and real-world experiments where the harnessed models matched the performance of leading proprietary systems and generalized across new scenarios. The authors report that the distilled 4B model, trained on under 2K simulated trajectories, achieved comparable results to frontier proprietary models and demonstrated strong generalization to unseen objects, novel instructions, and long-horizon tasks. The paper uses those cross-domain tests to argue the harness design, rather than sheer model scale or dataset size, is a primary enabler of embodied capability.

The study explores variations in agent workflows, action spaces, and observation modalities to isolate what matters. From those experiments the authors extracted the three key ingredients and packaged them as Guava, arguing the principles are effective even for smaller models when combined with a focused distillation pipeline.

Why it matters

Guava challenges the assumption that large-scale end-to-end training alone is necessary for high-performing embodied agents. By showing that a 4B open-source model can be distilled to perform competitively using fewer than 2K simulated trajectories, the work implies that careful system design and interfaces between reasoning models and embodied tools can unlock capabilities efficiently. That lowers the barrier for research groups and practitioners who cannot train huge end-to-end systems but can implement harnessed tool use and distillation pipelines.

What to watch

Watch for replication of the distillation pipeline on other open-source models and for broader benchmarks comparing harnessed compact models against large end-to-end systems. The authors’ choices about workflows, action abstractions, and multimodal observations are the concrete signals that will determine whether Guava’s approach generalizes across tasks and robot platforms.

References and provenance: the paper "Guava: An Effective and Universal Harness for Embodied Manipulation" (arXiv:2606.18363) was submitted 16 June 2026 by Haowen Liu, Xirui Li, Shaoxiong Yao, Peng Shi, Tianyi Zhou, Jia-Bin Huang, Furong Huang, and Jiayuan Mao.

Guava harness architecture and distillation pipeline

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

LLMs: gpt-4o, gpt-4.1-mini and claude-sonnet-4.6 study

Analysis of 21,000 multi-turn conversations finds human-like behaviors vary by model and user and can be modulated by system prompts.

The BrieftideDAILY BRIEF

ThinkDeception: Progressive RL framework for multimodal deception

ThinkDeception on arXiv uses MLLMs, a step-by-step multimodal Chain of Thought dataset and a four-tier progressive RL trainer for.

The BrieftideDAILY BRIEF

Visual-Seeker: visual-native multimodal search surpasses rivals

Zhengbo Zhang and 12 co-authors submitted Visual-Seeker on 13 Jun 2026.

The BrieftideDAILY BRIEF

Gemma 4 12B: unified, encoder-free multimodal model for laptops

Google DeepMind’s 12B model brings encoder-free vision and native audio to laptops, runs on 16GB memory and is released under Apache 2.0.