Multimodal AIJune 25, 20264 min read

Vision-language models and visual-search behavior: four tests

Farahnaz Wick adapts four classic visual-search paradigms and uses reasoning-token counts as a reaction-time analog to compare VLMs with.

The BrieftideJune 25, 2026

TL;DR

01Farahnaz Wick adapts four classic visual-search paradigms and uses reasoning-token counts as a reaction-time analog to compare VLMs with.
02Farahnaz Wick submitted a paper on 23 Jun 2026 that asks whether vision-language models search like humans.
03The paper frames these paradigms the way psychophysics does: reaction-time scaling with set size distinguishes parallel pop-out from serial, attention-demanding search.

Farahnaz Wick submitted a paper on 23 Jun 2026 that asks whether vision-language models search like humans. She adapts four classic visual-search paradigms and uses the number of reasoning ("thinking") tokens a model spends per trial as a within-model analog of reaction time, then compares model behavior to a large human benchmark (Wolfe et al., 2010).

What did the study test and how?

Wick tested four classic paradigms: feature versus conjunction search, spatial-configuration (T-vs-L) search, enumeration, and the tilted/vertical search asymmetry, presenting them to current frontier and mid-tier vision-language models. She treats the count of reasoning tokens per trial as an analog for reaction time because a single model call has no measurable temporal response, and she benchmarks model signatures against a public human dataset (Wolfe et al., 2010).

The paper frames these paradigms the way psychophysics does: reaction-time scaling with set size distinguishes parallel pop-out from serial, attention-demanding search. Wick also includes a resolution control to check whether conjunction costs come from genuine search rather than failure to resolve small shapes.

How do VLMs match or diverge from human visual search?

VLMs reproduce several human signatures but also show clear divergences. Specifically, models reproduce the canonical pattern that feature search yields flat effort while conjunction search effort climbs with set size. Frontier models maintain accuracy where mid-tier models collapse to chance. The resolution control indicates the conjunction cost reflects search rather than only shape-resolution difficulty.

At the same time, models diverge in informative ways. Wick finds the target-present effort slope exceeds the target-absent slope in models, reversing the human ordering. Enumeration remains accurate in models where humans would lose count. One reasoning model that uses adaptive deliberation effectively declines to deliberate on detection tasks, so that one VLM expresses difficulty as an effort gradient while another shows an accuracy cliff.

Why it matters

Applying psychophysical paradigms to model behavior exposes both shared mechanisms and important differences. Using the number of reasoning tokens as a reaction-time analog provides a low-cost, within-model measure of search effort that lets researchers map classical human signatures onto machine cognition. The divergences Wick documents, such as the reversed target-present/target-absent slope and preserved enumeration, point to specific ways machine visual processing departs from human attentional dynamics and thus identify precise targets for model analysis or improvement.

What to watch

Look for follow-up work that correlates token-based effort with internal model states or that tests whether the reversed target-present/target-absent ordering holds across more architectures and training regimes. Also watch for experiments that apply the same token-count analog to additional psychophysical tasks beyond the four paradigms Wick adapted.

References and provenance

The analysis and findings come from Farahnaz Wick, arXiv:2606.25066, submitted 23 Jun 2026, "Do vision-language models search like humans? Reasoning tokens as a reaction-time analog in classic visual-search paradigms." The human comparison dataset cited is Wolfe et al., 2010.

Human (Wolfe et al., 2010) vs VLMs (Wick 2026) across paradigms

Item
Feature search	Reaction time flat with set size (pop-out)	Effort flat with set size (reproduced)
Conjunction search	Reaction time climbs with set size (serial search)	Effort climbs with set size; frontier models hold accuracy; mid-tier collapse to chance; resolution control shows genuine search cost
Target-present vs target-absent slope	Target-absent slope exceeds target-present slope	Target-present effort slope exceeds target-absent slope (reversed ordering)
Enumeration	Humans lose count at larger set sizes	Models remain accurate where humans would lose count
Adaptive reasoning behavior	Not applicable	An adaptive deliberation model declines to deliberate on detection tasks, producing either effort gradients or accuracy cliffs across models

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

ReMMD: Multilingual Multi-Image Benchmark and Agent Release

ReMMD introduces ReMMDBench (500 samples, 2,756 images) and ReMMD-Agent; GPT-5.2 yields 41.80% accuracy and 39.12% macro-F1.

The BrieftideDAILY BRIEF

Amazon Nova embeddings beat Cohere for Vexcel aerial search

Amazon Nova Multimodal Embeddings, evaluated on Vexcel imagery via Amazon Bedrock.

The BrieftideDAILY BRIEF

LLMs: gpt-4o, gpt-4.1-mini and claude-sonnet-4.6 study

Analysis of 21,000 multi-turn conversations finds human-like behaviors vary by model and user and can be modulated by system prompts.

The BrieftideDAILY BRIEF

ThinkDeception: Progressive RL framework for multimodal deception

ThinkDeception on arXiv uses MLLMs, a step-by-step multimodal Chain of Thought dataset and a four-tier progressive RL trainer for.