Large Language Models: profiles are measurement artifacts
An arXiv paper using 56 instruction-tuned LLMs finds directional response bias explains 81–90% of between-model variation.
TL;DR
- 01An arXiv paper using 56 instruction-tuned LLMs finds directional response bias explains 81–90% of between-model variation.
- 02Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact, a paper submitted to arXiv on 18 June 2026 by Jelena Meyer, David Garcia and Dirk U.
- 03Wulff, finds that commonly used psychological instruments assign stable "profiles" to LLMs that are mostly artifacts of the tests themselves rather than properties of the models.
Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact, a paper submitted to arXiv on 18 June 2026 by Jelena Meyer, David Garcia and Dirk U. Wulff, finds that commonly used psychological instruments assign stable "profiles" to LLMs that are mostly artifacts of the tests themselves rather than properties of the models.
The authors administered a battery of personality and risk-preference instruments, including self-reports and behavioral tasks, to 56 instruction-tuned LLMs alongside large human reference samples. Their variance decomposition attributes 81 to 90 percent of between-model variation to a directional response bias, compared with 9 to 16 percent in humans. The paper argues that instruments borrowed from human psychology are rarely orthogonal enough for model assessment and calls for dedicated evaluations centered on response orthogonality.
What did the authors test and find?
They tested 56 instruction-tuned LLMs with a battery of personality and risk-preference instruments and compared results to large human reference samples. The core finding is that differences between models are driven not by the target traits but by a directional response bias, which makes models tend to choose one end of a scale or one labeled option regardless of item content.
Beyond the 81–90 percent attribution of between-model variance to bias, the paper reports that the same instruments attribute only 9–16 percent of between-person variance to that bias in humans. The authors also report that the bias declines with model capability but does not disappear, that an instrument's apparent reliability is almost entirely predicted by its response orthogonality, and that the apparent profile a model shows shifts with the items used and can be manufactured through item selection.
How does the paper define and measure the bias and response orthogonality?
Response bias is the tendency to respond toward one end of the scale or one labeled option regardless of item content, and response orthogonality is the proportion of items for which trait and bias point in opposite directions. The paper measures how much of observed variation is attributable to that directional bias versus the intended trait using a formal psychometric framework and variance decomposition.
The authors find that instruments with higher response orthogonality appear more reliable because orthogonality reduces the extent to which bias and trait align. Because most human psychometric instruments are not fully orthogonal, their validity for LLMs is questionable. The paper demonstrates that by selecting different sets of items researchers can shift the profile a model appears to have, indicating that profiles are instrument-dependent rather than model-intrinsic.
Why it matters
If psychological tests primarily pick up a directional response bias in LLMs, then using these instruments to judge model "personality," safety preferences, or as proxies for human participants will produce misleading conclusions. The bias-driven profiles can affect usability and safety assessment, and they undermine research that treats LLMs as stable, human-like respondents. The paper recommends developing assessments that center on response orthogonality to avoid conflating measurement artifacts with model properties.
What to watch
Look for follow-up work that operationalizes response orthogonality into new benchmarks or test batteries, and for papers measuring whether the bias continues to decline with newer, more capable instruction-tuned models. Another concrete signal will be whether instrument designers publish item-level orthogonality metrics or provide item-selection procedures that prevent manufactured profiles.
Bibliographic note: the paper is arXiv:2606.20205, submitted 18 June 2026, and authored by Jelena Meyer, David Garcia and Dirk U. Wulff. The authors provide code, data and media links referenced on the arXiv page alongside the PDF and TeX source.
| Item | |||
|---|---|---|---|
| Instruction-tuned LLMs | 81-90 | 56 models | |
| Human reference samples | 9-16 | large human reference samples |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIThinkDeception: Progressive RL framework for multimodal deception
ThinkDeception on arXiv uses MLLMs, a step-by-step multimodal Chain of Thought dataset and a four-tier progressive RL trainer for.
Visual-Seeker: visual-native multimodal search surpasses rivals
Zhengbo Zhang and 12 co-authors submitted Visual-Seeker on 13 Jun 2026.
Gemma 4 12B: unified, encoder-free multimodal model for laptops
Google DeepMind’s 12B model brings encoder-free vision and native audio to laptops, runs on 16GB memory and is released under Apache 2.0.
Hugging Face Spaces agents.md: chain image to 3D splats
An agent used two Hugging Face Spaces and their agents.md files to auto-generate images, reconstruct 3D Gaussian splats.