Benchmarks & EvalsJune 25, 20265 min read

WinDOM: Self-Family Distillation for 2B GUI grounding

A 54,425‑record DOM-derived corpus and Self‑Family Distillation push a Qwen3.5‑2B student to +5.4 OOD‑mean on GUI grounding benchmarks.

The BrieftideJune 25, 2026

TL;DR

01A 54,425‑record DOM-derived corpus and Self‑Family Distillation push a Qwen3.5‑2B student to +5.4 OOD‑mean on GUI grounding benchmarks.
02WinDOM arrives as a focused recipe for improving small (~2B) GUI-grounding models, combining a 54,425-record grounding corpus with a training technique the authors call Self-Family Distillation.
03The paper was submitted to arXiv on 24 Jun 2026 and positions both the dataset and the distillation method to improve small-model performance without scaling up.

WinDOM arrives as a focused recipe for improving small (~2B) GUI-grounding models, combining a 54,425-record grounding corpus with a training technique the authors call Self-Family Distillation. The paper was submitted to arXiv on 24 Jun 2026 and positions both the dataset and the distillation method to improve small-model performance without scaling up.

What is the WinDOM corpus and how was it gathered?

WinDOM is a 54,425-record grounding corpus harvested by driving an open-source Windows 11 web reimplementation under headless Playwright, with bounding boxes read directly off the DOM and no OCR or human annotation. The dataset therefore pairs GUI elements with DOM-derived bounding boxes, produced automatically by interaction with the reimplementation rather than by manual labelling.

The collection method isolates a practical path to produce bounding-box training data for GUI tasks: drive a web reimplementation headlessly, read bounding boxes from the DOM, and record element-context pairs. The paper frames this approach as an alternative to expensive human annotation, explicitly targeting small-model training scenarios such as on-device deployment and accessibility tooling.

What is Self-Family Distillation and how is it applied?

Self-Family Distillation, abbreviated SFD, is a single rejection-sampling cold-start procedure parameterised only by the teacher choice: either an exponential moving average (EMA) of the student (no external model) or a frozen larger same-family teacher. The method treats the saturation depth of the SFD cold-start as an explicit GRPO hyperparameter and uses that cold-start to initialise reinforcement learning (GRPO) training.

SFD therefore provides two practical modes: an EMA-only mode that requires no external teacher, and a cross-size mode that uses a larger same-family teacher (the paper reports a 4B variant as the larger teacher). The authors emphasise the cold-start saturation depth as an adjustable knob rather than a fixed convergence target, and they evaluate how under-saturated versus converged cold-starts affect downstream RL fine-tuning.

How well does WinDOM plus SFD improve small models?

On a Qwen3.5-2B student, an under-saturated cold-start proved to be a better GRPO initialiser than a converged one. The SFD-4B variant combined with Early-init RL produced a +5.4 OOD-mean improvement over the base Qwen3.5-2B model. The paper breaks that OOD-mean gain into per-benchmark lifts of +3.5 on ScreenSpot-Pro, +7.0 on OSWorld-G, and +5.8 on ScreenSpot-V2.

When the authors used the EMA same-size mode instead of an external larger teacher, the EMA mode landed within roughly one OOD-mean point of the cross-size 4B variant, with reported OOD-mean values of 65.2 for the EMA same-size mode versus 66.3 for the cross-size 4B variant. Those points underline that an EMA-only SFD can approach the performance of a larger-teacher distillation without requiring an external model.

Why it matters

Small (~2B) GUI-grounding agents are attractive for on-device deployment, accessibility tooling, and low-cost iteration, and WinDOM targets those constraints directly. The combination of an automatically harvested 54,425-record DOM-derived corpus and a distillation strategy that can run without external teachers addresses two key bottlenecks for small models: labelled training data and practical fine-tuning recipes. The reported +5.4 OOD-mean and per-benchmark gains show that targeted engineering at the data and cold-start level can improve small-model results without simply scaling parameter counts.

What to watch

Watch for whether the authors release the WinDOM corpus and associated code and whether other small students replicate the reported gains. A concrete next milestone would be broader evaluations of SFD-EMA across different small-model families and public availability of the 54,425-record dataset to let external groups validate the +3.5 to +7.0 per-benchmark improvements described in the paper.

The paper and dataset summary appear on arXiv (submitted 24 Jun 2026) under the title "WinDOM: Self-Family Distillation for Small-Model GUI Grounding."

SFD gains vs OOD-mean comparison

Item
SFD-4B with Early-init RL (vs base)	3.5	7	5.8	5.4
EMA same-size (reported OOD-mean)	-	-	-	65.2
Cross-size 4B (reported OOD-mean)	-	-	-	66.3

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

CORE-Bench: Life After Benchmark Saturation, v1.1 Findings

arXiv paper shows CORE-Bench v1.1 and CORE-Bench OOD expose construct validity, efficiency, reliability and a twofold human-agent speedup.

The BrieftideDAILY BRIEF

T2D-Bench: Benchmarking LLMs for Type 2 Diabetes Evidence

A multi-layer clinical-lifestyle knowledge graph flags unsupported LLM diabetes recommendations and corrects them across 100 vignettes.

The BrieftideDAILY BRIEF

InvestPhilBench v0.6: Benchmark for LLM Investment Procedure

v0.6 supplies 118 verified investment principle cards, 25 framework cards and 243 QA items plus an automated scoring suite called BASP.

The BrieftideDAILY BRIEF

BIM-Edit: Benchmarking LLMs for IFC-based BIM Editing

BIM-Edit evaluates LLMs on 324 IFC editing tasks across 11 real models and 36 synthetic scenes; the top model averages 49.5%.