LLMs: gpt-4o, gpt-4.1-mini and claude-sonnet-4.6 study
Analysis of 21,000 multi-turn conversations finds human-like behaviors vary by model and user and can be modulated by system prompts.
TL;DR
- 01Analysis of 21,000 multi-turn conversations finds human-like behaviors vary by model and user and can be modulated by system prompts.
- 02Examining Human-Like Behaviors in LLMs, a paper submitted to arXiv on 7 May 2026 by Sunnie S.
- 03Kim, Margit Bowler and Leon A Gatys, analyzed 21,000 multi-turn conversations across four widely used models: gpt-4o, gpt-4.1-mini, claude-sonnet-4.6 and gemini-2.5-flash.
Examining Human-Like Behaviors in LLMs, a paper submitted to arXiv on 7 May 2026 by Sunnie S. Y. Kim, Margit Bowler and Leon A Gatys, analyzed 21,000 multi-turn conversations across four widely used models: gpt-4o, gpt-4.1-mini, claude-sonnet-4.6 and gemini-2.5-flash. The study measures how frequently LLMs exhibit human-like behaviors, how those behaviors are judged by people, and how system prompts can change them.
What did the study analyze?
The paper analyzed 21,000 multi-turn conversations drawn from four models, focusing on prevalence, potential effects, and controllability of human-like behaviors. The authors used both an LLM-as-a-judge method and human evaluation to assess behavior types across different conversation goals and user profiles, and they framed results within human-computer interaction and AI research.
The dataset and methods are central to the claim that "human-like behaviors are pervasive." The study contrasts model outputs across the four named models and ties variation to user factors such as conversation goals and user profiles, rather than treating all LLM responses as uniform. The submission is cataloged on arXiv as arXiv:2606.18258 in the cs.HC and cs.AI categories.
How did evaluators judge LLM behaviors?
Human evaluators rated self-referential and relationship-building behaviors as less appropriate from LLMs than from humans, while rating boundary-maintaining behaviors as more appropriate from LLMs than from humans. Those comparative judgments form the paper's key findings about perceived appropriateness.
The study separates behavior types into at least three categories: self-referential expressions, relationship-building moves, and boundary-maintaining refusals or limits. Human evaluation produced a consistent pattern: people expect less self-reference and fewer attempts to build rapport from LLMs compared with humans, but they prefer LLMs to hold boundaries. The paper pairs these human judgments with an automated LLM-as-a-judge pipeline to scale analysis across the 21,000 conversations.
Can system prompts control these behaviors?
System prompting can alter the presence of human-like behaviors, but control is not straightforward and requires careful evaluation to avoid unintended effects. The paper demonstrates that prompts change model behavior patterns, yet it warns that interventions demand empirical checks to ensure they do not introduce other issues.
The authors tested system-level prompts as a lever for controllability and found measurable shifts in behavior prevalence across the evaluated conversations. They emphasize the need for rigorous assessment after prompt changes, noting that control succeeds in some dimensions while potentially producing side effects in others.
Why it matters
The findings matter because they quantify an often-discussed phenomenon across a large sample and across multiple deployed model families. By analyzing 21,000 multi-turn conversations and comparing four named models, the paper provides evidence that expectations about appropriate machine behavior differ from expectations about human behavior, and that those differences are model- and user-dependent. Designers, product teams and policy-makers gain a data-grounded basis for choices about when LLMs should act more or less human-like.
What to watch
Look for follow-up work that shares per-model breakdowns and concrete prompt designs, and for evaluations that link behavior changes to user outcomes. The authors state they provide recommendations for responsible LLM design and evaluation; the next milestone will be empirical tests of those recommendations in deployed settings.
Authors and metadata: the paper is titled "Examining Human-Like Behaviors in LLMs: A Multi-Dimensional Analysis of Model Behaviors, User Factors, and System Prompts," submitted 7 May 2026 by Sunnie S. Y. Kim, Margit Bowler and Leon A Gatys, arXiv:2606.18258, in the Human-Computer Interaction and Artificial Intelligence categories.
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIThinkDeception: Progressive RL framework for multimodal deception
ThinkDeception on arXiv uses MLLMs, a step-by-step multimodal Chain of Thought dataset and a four-tier progressive RL trainer for.
Visual-Seeker: visual-native multimodal search surpasses rivals
Zhengbo Zhang and 12 co-authors submitted Visual-Seeker on 13 Jun 2026.
Gemma 4 12B: unified, encoder-free multimodal model for laptops
Google DeepMind’s 12B model brings encoder-free vision and native audio to laptops, runs on 16GB memory and is released under Apache 2.0.
Hugging Face Spaces agents.md: chain image to 3D splats
An agent used two Hugging Face Spaces and their agents.md files to auto-generate images, reconstruct 3D Gaussian splats.