Large Reasoning Models Persist While Humans Disengage
Paper shows LRMs spend more tokens on problems they fail while humans spend less; both track difficulty across items yet diverge within.
TL;DR
- 01Paper shows LRMs spend more tokens on problems they fail while humans spend less; both track difficulty across items yet diverge within.
- 02Han-yu Wang's paper, submitted to arXiv on 25 Jun 2026, shows large reasoning models spend more tokens on problems they get wrong while humans spend less time on the same trials.
- 03Every tested reasoning model shows a large wrong-versus-right token-length effect, with Cohen's d = 1.47–3.13 on H-ARC, while humans show the opposite sign.
Han-yu Wang's paper, submitted to arXiv on 25 Jun 2026, shows large reasoning models spend more tokens on problems they get wrong while humans spend less time on the same trials. Every tested reasoning model shows a large wrong-versus-right token-length effect, with Cohen's d = 1.47–3.13 on H-ARC, while humans show the opposite sign.
What exactly did the study measure?
The study separated two questions: how response time or trace length tracks difficulty across items (registration), and, holding item identity fixed, whether an agent spends more tokens/time on its own failures or successes (allocation). The paper uses a public matched human-LRM corpus and examines five thinking large reasoning models, comparing their token-trace lengths against human response times. The authors keep each agent on its own measurement scale and explicitly note they never put seconds and tokens on one axis.
How do LRMs and humans differ on the same problem?
LRMs and humans agree on registration: both show longer traces or response times on harder items. They diverge on allocation: every LRM shows a large wrong-vs-right effect in which wrong trials are longer, while humans spend less time on trials they get wrong. On the H-ARC dataset the LRM wrong-vs-right effect sizes span Cohen's d = 1.47 to 3.13. A non-thinking baseline shows no such wrong-vs-right effect. The paper interprets the human pattern as "engagement versus abandonment" and reads the LRM pattern as chain growth driven by uncertainty: chains grow when the model is unsure, which coincides with failure.
Why does this matter?
The finding exposes a measurement blind spot: trace length captures a difficulty signal but not the stopping policy that determines whether time is allocated to likely successes or failures. Two agents can show the same cross-item correlation with difficulty yet implement opposite within-item control. That distinction matters for how researchers interpret longer traces as signs of deliberation. If trace length conflates uncertainty with persistent deliberation, claims about an LRM's humanlike deliberative behavior can be misleading.
What methods support the claim?
The paper holds item identity fixed and reports the dissociation under item fixed effects. The divergence replicates across datasets in the matched corpus. The analysis contrasts five thinking LRMs against human behavior and a non-thinking baseline, documenting consistent cross-item registration but opposite allocation within items. The numeric anchor is the H-ARC Cohen's d range of 1.47 to 3.13 for the LRM wrong-vs-right effect.
What to watch
Look for whether future evaluations adopt within-item controls and report allocation effects in addition to cross-item difficulty correlations. A concrete next milestone would be other teams reproducing the H-ARC wrong-vs-right Cohen's d range for additional models or showing the human opposite-sign effect on other matched corpora.
References: Han-yu Wang, "Humans Disengage, Reasoning Models Persist: Separating Difficulty Registration from Deliberation Allocation", arXiv:2606.26502, submitted 25 Jun 2026.
| Item | |||
|---|---|---|---|
| Thinking LRMs (five models) | Positive; Cohen's d = 1.47–3.13 on H-ARC | Reproduces known cross-item alignment with difficulty | |
| Humans | Negative sign (spend less time on trials they get wrong) | Reproduces known cross-item alignment with difficulty | |
| Non-thinking baseline | Absent (no wrong-vs-right effect) | Does not show the same pattern |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Reasoning VerificationData-driven ML and GPT-5: arXiv finds limits for symbolic logic
An arXiv paper by Tiansi Dong, Mateja Jamnik and Pietro Liò argues supervised deep learning cannot reach symbolic-level syllogistic.
Governing Actions, Not Agents: Institutional Attestation Model
Jakob Salfeld-Nebgen formalises a governance model where agents plan but execution of high-risk acts requires independent.
Verification Horizon: No Silver Bullet for Coding Agent Rewards
An arXiv paper argues verification, not generation, is the harder problem for coding agents and that verification must co-evolve with.
Multi-Level Validation Framework for AI Telescope Scheduling
A multi-level framework adds data-reference checks, logical consistency tests and atomic reasoning units to improve executability and.