Benchmarks & EvalsMay 11, 20264 min readvia Microsoft Research

SocialReasoning-Bench finds agents often fail users

Microsoft Research’s SocialReasoning-Bench tests agents across hundreds of scenarios and finds they complete tasks but do not reliably.

The Brieftide

May 11, 2026

TL;DR

01Microsoft Research’s SocialReasoning-Bench tests agents across hundreds of scenarios and finds they complete tasks but do not reliably.
02Microsoft Research released SocialReasoning-Bench and published initial results this week after evaluating dozens of agent configurations across hundreds of simulated scenarios.
03SocialReasoning-Bench frames agent evaluation around counterfactual user utility.

Microsoft Research released SocialReasoning-Bench and published initial results this week after evaluating dozens of agent configurations across hundreds of simulated scenarios. The benchmark measures whether an agent's actions increase expected user utility, and the team found a recurring pattern: agents complete steps competently but do not reliably improve the user's position.

How SocialReasoning-Bench works

SocialReasoning-Bench frames agent evaluation around counterfactual user utility. For each scenario the benchmark defines a user goal, a starting state, possible actions, and a utility function that captures the user's true interests. Agents are scored by the change in expected utility produced by their actions relative to simple baselines, such as taking no action or following the user's literal instruction.

The benchmark covers single-turn and multi-step tasks, including planning, tool use, and multi-action interactions where intermediate steps may trade short-term task completion against longer-term user benefit. Evaluations use simulated users and reward models to compute utilities, and the suite compares agent styles: single-shot LLM responses, planner-plus-executor chains, and agents augmented with external tools or retrieval. Microsoft Research ran these tests across models from major vendors and several open-source families to surface cross-model patterns.

Key findings

The dominant finding is consistency across model classes: many agents execute instructions or complete subgoals but fail to increase the user utility metric consistently. In practice that manifests in several ways: agents follow literal user commands that produce neutral or negative net utility, agents prioritize lower-cost routes that do not advance the user's position, and multi-step planners sometimes pursue internally coherent plans that are misaligned with the underlying utility function.

SocialReasoning-Bench also highlights failure modes that are not captured by conventional benchmarks focused on task completion or correctness. An agent can obtain high accuracy on a checklist of actions while producing little or negative utility change when measured against a counterfactual baseline. The benchmark identifies the role of objective specification errors, poor uncertainty calibration, and insufficient consideration of downstream consequences as recurring contributors to these gaps.

Microsoft Research tested a set of mitigation strategies within the benchmark. Agents that queried clarifying questions, used explicit utility estimation during planning, or incorporated calibrated reward models performed better on the benchmark's utility metric, though improvements were uneven and sensitive to tuning. The team suggests that evaluation against user-centered counterfactuals exposes different failure modes than accuracy-focused tests, and that iterative improvement of objectives and uncertainty handling is necessary.

Why it matters

The findings shift attention from task competence to the actual benefits agents deliver to users, showing that correctness alone is an incomplete proxy for user welfare. Product teams, researchers, and regulators evaluating deployed assistants will need metrics and tests like SocialReasoning-Bench to surface harms and perverse incentives that standard benchmarks miss.

How agents perform on task completion vs. user utility

Item
Task completion	Often high	Often high	High when tools used
Increase in measured user utility	Mixed, inconsistent	Mixed, sometimes negative	Improves with calibration
Asks clarifying questions	Rare	Occasional	More frequent when designed
Sensitivity to objective specification	High	High	Moderate

Primary source

Microsoft Research

microsoft.com

Read the original

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeNo adsNo trackingUnsubscribe in one click