Multimodal AIJune 20, 20264 min read

Large Language Models: profiles are measurement artifacts

An arXiv paper using 56 instruction-tuned LLMs finds directional response bias explains 81–90% of between-model variation.

The BrieftideJune 20, 2026

TL;DR

01An arXiv paper using 56 instruction-tuned LLMs finds directional response bias explains 81–90% of between-model variation.
02Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact, a paper submitted to arXiv on 18 June 2026 by Jelena Meyer, David Garcia and Dirk U.
03Wulff, finds that commonly used psychological instruments assign stable "profiles" to LLMs that are mostly artifacts of the tests themselves rather than properties of the models.

Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact, a paper submitted to arXiv on 18 June 2026 by Jelena Meyer, David Garcia and Dirk U. Wulff, finds that commonly used psychological instruments assign stable "profiles" to LLMs that are mostly artifacts of the tests themselves rather than properties of the models.

The authors administered a battery of personality and risk-preference instruments, including self-reports and behavioral tasks, to 56 instruction-tuned LLMs alongside large human reference samples. Their variance decomposition attributes 81 to 90 percent of between-model variation to a directional response bias, compared with 9 to 16 percent in humans. The paper argues that instruments borrowed from human psychology are rarely orthogonal enough for model assessment and calls for dedicated evaluations centered on response orthogonality.

What did the authors test and find?

They tested 56 instruction-tuned LLMs with a battery of personality and risk-preference instruments and compared results to large human reference samples. The core finding is that differences between models are driven not by the target traits but by a directional response bias, which makes models tend to choose one end of a scale or one labeled option regardless of item content.

Beyond the 81–90 percent attribution of between-model variance to bias, the paper reports that the same instruments attribute only 9–16 percent of between-person variance to that bias in humans. The authors also report that the bias declines with model capability but does not disappear, that an instrument's apparent reliability is almost entirely predicted by its response orthogonality, and that the apparent profile a model shows shifts with the items used and can be manufactured through item selection.

How does the paper define and measure the bias and response orthogonality?

Response bias is the tendency to respond toward one end of the scale or one labeled option regardless of item content, and response orthogonality is the proportion of items for which trait and bias point in opposite directions. The paper measures how much of observed variation is attributable to that directional bias versus the intended trait using a formal psychometric framework and variance decomposition.

The authors find that instruments with higher response orthogonality appear more reliable because orthogonality reduces the extent to which bias and trait align. Because most human psychometric instruments are not fully orthogonal, their validity for LLMs is questionable. The paper demonstrates that by selecting different sets of items researchers can shift the profile a model appears to have, indicating that profiles are instrument-dependent rather than model-intrinsic.

Why it matters

If psychological tests primarily pick up a directional response bias in LLMs, then using these instruments to judge model "personality," safety preferences, or as proxies for human participants will produce misleading conclusions. The bias-driven profiles can affect usability and safety assessment, and they undermine research that treats LLMs as stable, human-like respondents. The paper recommends developing assessments that center on response orthogonality to avoid conflating measurement artifacts with model properties.

What to watch

Look for follow-up work that operationalizes response orthogonality into new benchmarks or test batteries, and for papers measuring whether the bias continues to decline with newer, more capable instruction-tuned models. Another concrete signal will be whether instrument designers publish item-level orthogonality metrics or provide item-selection procedures that prevent manufactured profiles.

Bibliographic note: the paper is arXiv:2606.20205, submitted 18 June 2026, and authored by Jelena Meyer, David Garcia and Dirk U. Wulff. The authors provide code, data and media links referenced on the arXiv page alongside the PDF and TeX source.

Variance attributed to directional response bias

Item
Instruction-tuned LLMs	81-90	56 models
Human reference samples	9-16	large human reference samples

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

ThinkDeception: Progressive RL framework for multimodal deception

ThinkDeception on arXiv uses MLLMs, a step-by-step multimodal Chain of Thought dataset and a four-tier progressive RL trainer for.

The BrieftideDAILY BRIEF

Visual-Seeker: visual-native multimodal search surpasses rivals

Zhengbo Zhang and 12 co-authors submitted Visual-Seeker on 13 Jun 2026.

The BrieftideDAILY BRIEF

Gemma 4 12B: unified, encoder-free multimodal model for laptops

Google DeepMind’s 12B model brings encoder-free vision and native audio to laptops, runs on 16GB memory and is released under Apache 2.0.

The BrieftideDAILY BRIEF

Hugging Face Spaces agents.md: chain image to 3D splats

An agent used two Hugging Face Spaces and their agents.md files to auto-generate images, reconstruct 3D Gaussian splats.