Multimodal AI4 min read

Vision-language models and visual-search behavior: four tests

Farahnaz Wick adapts four classic visual-search paradigms and uses reasoning-token counts as a reaction-time analog to compare VLMs with.

The Brieftide

TL;DR

  • 01Farahnaz Wick adapts four classic visual-search paradigms and uses reasoning-token counts as a reaction-time analog to compare VLMs with.
  • 02Farahnaz Wick submitted a paper on 23 Jun 2026 that asks whether vision-language models search like humans.
  • 03The paper frames these paradigms the way psychophysics does: reaction-time scaling with set size distinguishes parallel pop-out from serial, attention-demanding search.

Farahnaz Wick submitted a paper on 23 Jun 2026 that asks whether vision-language models search like humans. She adapts four classic visual-search paradigms and uses the number of reasoning ("thinking") tokens a model spends per trial as a within-model analog of reaction time, then compares model behavior to a large human benchmark (Wolfe et al., 2010).

What did the study test and how?

Wick tested four classic paradigms: feature versus conjunction search, spatial-configuration (T-vs-L) search, enumeration, and the tilted/vertical search asymmetry, presenting them to current frontier and mid-tier vision-language models. She treats the count of reasoning tokens per trial as an analog for reaction time because a single model call has no measurable temporal response, and she benchmarks model signatures against a public human dataset (Wolfe et al., 2010).

The paper frames these paradigms the way psychophysics does: reaction-time scaling with set size distinguishes parallel pop-out from serial, attention-demanding search. Wick also includes a resolution control to check whether conjunction costs come from genuine search rather than failure to resolve small shapes.

VLMs reproduce several human signatures but also show clear divergences. Specifically, models reproduce the canonical pattern that feature search yields flat effort while conjunction search effort climbs with set size. Frontier models maintain accuracy where mid-tier models collapse to chance. The resolution control indicates the conjunction cost reflects search rather than only shape-resolution difficulty.

At the same time, models diverge in informative ways. Wick finds the target-present effort slope exceeds the target-absent slope in models, reversing the human ordering. Enumeration remains accurate in models where humans would lose count. One reasoning model that uses adaptive deliberation effectively declines to deliberate on detection tasks, so that one VLM expresses difficulty as an effort gradient while another shows an accuracy cliff.

Why it matters

Applying psychophysical paradigms to model behavior exposes both shared mechanisms and important differences. Using the number of reasoning tokens as a reaction-time analog provides a low-cost, within-model measure of search effort that lets researchers map classical human signatures onto machine cognition. The divergences Wick documents, such as the reversed target-present/target-absent slope and preserved enumeration, point to specific ways machine visual processing departs from human attentional dynamics and thus identify precise targets for model analysis or improvement.

What to watch

Look for follow-up work that correlates token-based effort with internal model states or that tests whether the reversed target-present/target-absent ordering holds across more architectures and training regimes. Also watch for experiments that apply the same token-count analog to additional psychophysical tasks beyond the four paradigms Wick adapted.

References and provenance

The analysis and findings come from Farahnaz Wick, arXiv:2606.25066, submitted 23 Jun 2026, "Do vision-language models search like humans? Reasoning tokens as a reaction-time analog in classic visual-search paradigms." The human comparison dataset cited is Wolfe et al., 2010.

Human (Wolfe et al., 2010) vs VLMs (Wick 2026) across paradigms
Item
Feature searchReaction time flat with set size (pop-out)Effort flat with set size (reproduced)
Conjunction searchReaction time climbs with set size (serial search)Effort climbs with set size; frontier models hold accuracy; mid-tier collapse to chance; resolution control shows genuine search cost
Target-present vs target-absent slopeTarget-absent slope exceeds target-present slopeTarget-present effort slope exceeds target-absent slope (reversed ordering)
EnumerationHumans lose count at larger set sizesModels remain accurate where humans would lose count
Adaptive reasoning behaviorNot applicableAn adaptive deliberation model declines to deliberate on detection tasks, producing either effort gradients or accuracy cliffs across models
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement