Vision-language models and visual-search behavior: four tests
Farahnaz Wick adapts four classic visual-search paradigms and uses reasoning-token counts as a reaction-time analog to compare VLMs with.
TL;DR
- 01Farahnaz Wick adapts four classic visual-search paradigms and uses reasoning-token counts as a reaction-time analog to compare VLMs with.
- 02Farahnaz Wick submitted a paper on 23 Jun 2026 that asks whether vision-language models search like humans.
- 03The paper frames these paradigms the way psychophysics does: reaction-time scaling with set size distinguishes parallel pop-out from serial, attention-demanding search.
Farahnaz Wick submitted a paper on 23 Jun 2026 that asks whether vision-language models search like humans. She adapts four classic visual-search paradigms and uses the number of reasoning ("thinking") tokens a model spends per trial as a within-model analog of reaction time, then compares model behavior to a large human benchmark (Wolfe et al., 2010).
What did the study test and how?
Wick tested four classic paradigms: feature versus conjunction search, spatial-configuration (T-vs-L) search, enumeration, and the tilted/vertical search asymmetry, presenting them to current frontier and mid-tier vision-language models. She treats the count of reasoning tokens per trial as an analog for reaction time because a single model call has no measurable temporal response, and she benchmarks model signatures against a public human dataset (Wolfe et al., 2010).
The paper frames these paradigms the way psychophysics does: reaction-time scaling with set size distinguishes parallel pop-out from serial, attention-demanding search. Wick also includes a resolution control to check whether conjunction costs come from genuine search rather than failure to resolve small shapes.
How do VLMs match or diverge from human visual search?
VLMs reproduce several human signatures but also show clear divergences. Specifically, models reproduce the canonical pattern that feature search yields flat effort while conjunction search effort climbs with set size. Frontier models maintain accuracy where mid-tier models collapse to chance. The resolution control indicates the conjunction cost reflects search rather than only shape-resolution difficulty.
At the same time, models diverge in informative ways. Wick finds the target-present effort slope exceeds the target-absent slope in models, reversing the human ordering. Enumeration remains accurate in models where humans would lose count. One reasoning model that uses adaptive deliberation effectively declines to deliberate on detection tasks, so that one VLM expresses difficulty as an effort gradient while another shows an accuracy cliff.
Why it matters
Applying psychophysical paradigms to model behavior exposes both shared mechanisms and important differences. Using the number of reasoning tokens as a reaction-time analog provides a low-cost, within-model measure of search effort that lets researchers map classical human signatures onto machine cognition. The divergences Wick documents, such as the reversed target-present/target-absent slope and preserved enumeration, point to specific ways machine visual processing departs from human attentional dynamics and thus identify precise targets for model analysis or improvement.
What to watch
Look for follow-up work that correlates token-based effort with internal model states or that tests whether the reversed target-present/target-absent ordering holds across more architectures and training regimes. Also watch for experiments that apply the same token-count analog to additional psychophysical tasks beyond the four paradigms Wick adapted.
References and provenance
The analysis and findings come from Farahnaz Wick, arXiv:2606.25066, submitted 23 Jun 2026, "Do vision-language models search like humans? Reasoning tokens as a reaction-time analog in classic visual-search paradigms." The human comparison dataset cited is Wolfe et al., 2010.
| Item | ||
|---|---|---|
| Feature search | Reaction time flat with set size (pop-out) | Effort flat with set size (reproduced) |
| Conjunction search | Reaction time climbs with set size (serial search) | Effort climbs with set size; frontier models hold accuracy; mid-tier collapse to chance; resolution control shows genuine search cost |
| Target-present vs target-absent slope | Target-absent slope exceeds target-present slope | Target-present effort slope exceeds target-absent slope (reversed ordering) |
| Enumeration | Humans lose count at larger set sizes | Models remain accurate where humans would lose count |
| Adaptive reasoning behavior | Not applicable | An adaptive deliberation model declines to deliberate on detection tasks, producing either effort gradients or accuracy cliffs across models |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIReMMD: Multilingual Multi-Image Benchmark and Agent Release
ReMMD introduces ReMMDBench (500 samples, 2,756 images) and ReMMD-Agent; GPT-5.2 yields 41.80% accuracy and 39.12% macro-F1.
Amazon Nova embeddings beat Cohere for Vexcel aerial search
Amazon Nova Multimodal Embeddings, evaluated on Vexcel imagery via Amazon Bedrock.
LLMs: gpt-4o, gpt-4.1-mini and claude-sonnet-4.6 study
Analysis of 21,000 multi-turn conversations finds human-like behaviors vary by model and user and can be modulated by system prompts.
ThinkDeception: Progressive RL framework for multimodal deception
ThinkDeception on arXiv uses MLLMs, a step-by-step multimodal Chain of Thought dataset and a four-tier progressive RL trainer for.