Multimodal AIJuly 2, 20264 min read

DigitalCoach dataset: 72 coaching sessions, 22,752 dialogue turns

Multimodal dataset contains 28.1 hours of screen and input recordings across five apps and compares model and human coaching behaviors.

The BrieftideJuly 2, 2026

TL;DR

01Multimodal dataset contains 28.1 hours of screen and input recordings across five apps and compares model and human coaching behaviors.
02The dataset was designed to test whether current agentic models can teach humans how to use software, and to measure differences in communication and grounding between models and people.
03The dataset aligns utterances with visual context and input traces so researchers can evaluate both conversational behavior and visual grounding in computer use coaching.

DigitalCoach, a multimodal dataset submitted to arXiv on 30 Jun 2026, captures 72 human expert-novice computer use coaching sessions totaling 22,752 dialogue turns and 28.1 hours of synchronized screen and input event recordings across five software applications. The dataset was designed to test whether current agentic models can teach humans how to use software, and to measure differences in communication and grounding between models and people.

What is DigitalCoach?

DigitalCoach is a paired dialogue and interaction corpus: 72 expert-novice coaching sessions, 22,752 dialogue turns, and 28.1 hours of screen and input event recordings spanning five software applications. The dataset aligns utterances with visual context and input traces so researchers can evaluate both conversational behavior and visual grounding in computer use coaching.

The collection emphasizes real coaching interactions rather than scripted exchanges, producing a multimodal resource that links spoken or typed guidance to the learner's on-screen state and inputs. The arXiv submission lists seven authors: Meng Chen, Anya Ji, Tsung-Han Wu, Tobias Maringgele, David M. Chan, Alane Suhr, and Amy Pavel.

How do state-of-the-art models compare to human coaches?

Automated evaluation shows a clear behavioral gap: "models provide more direct instructions, but fewer explanations, error diagnoses, and knowledge-check questions." When coaching method is fixed, models produce utterances similar to human references in style, yet their outputs are poorly grounded in the visual context captured by the dataset.

Interactive evaluation amplifies those findings. Model coaches tend to cause learners to follow instructions passively rather than engage in deeper problem solving, and the models underperform on visually grounded guidance. The paper frames these differences along two axes: communicative content (direct instruction versus explanation and diagnosis) and grounding quality (utterance relevance to on-screen state and input events).

Why it matters

DigitalCoach exposes where current agentic systems fall short at teaching practical skills. Models that favor direct instructions over explanations and diagnostic checks can produce superficial task completion without building user understanding. Poor visual grounding means an agent might give plausible-sounding guidance that does not map to the learner's actual screen state, increasing user confusion.

Those failings affect anyone building tutoring or in-application assistants: designers need metrics and data that capture explanation frequency, error diagnosis, knowledge checks, and grounding fidelity. DigitalCoach supplies those raw signals by combining dialogue with 28.1 hours of recorded interactions across five applications, enabling targeted improvements to model training and evaluation.

What to watch

Look for model evaluations that report improvements on the specific axes DigitalCoach highlights: increases in explanations, error diagnoses, and knowledge-check questions, and measurable gains in visual grounding during interactive tests. Another concrete signal will be follow-up work that uses the dataset to reduce passive user behavior when interacting with model-based coaches.

DigitalCoach lays a foundation for evaluating collaborative and proactive computer use coaching agents, offering a concrete benchmark of 72 sessions and 22,752 turns to measure both what agents say and whether what they say maps to what users actually see and do.

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

MMIR-TCM: multimodal TCM AI framework outperforms GPT-4o, Gemini

MMIR-TCM pairs Memory-SAM, fine-tuned Qwen3-VL and a Qwen3 RAG pipeline.

The BrieftideDAILY BRIEF

MIT Masked IRL: LLMs help robots clarify and ignore cues

MIT’s Masked IRL uses two LLMs to clarify vague prompts, cut demonstration data nearly fivefold.

The BrieftideDAILY BRIEF

Multimodal LLM evaluation: four missing capabilities (2026)

A paper by Po-han Li et al. finds benchmarks miss temporal-spatial coherence, physical-world understanding.

The BrieftideDAILY BRIEF

ReMMD: Multilingual Multi-Image Benchmark and Agent Release

ReMMD introduces ReMMDBench (500 samples, 2,756 images) and ReMMD-Agent; GPT-5.2 yields 41.80% accuracy and 39.12% macro-F1.