Benchmarks & EvalsJune 17, 20265 min read

EComAgentBench: 662-task shopping agent benchmark with hidden

A 662-task benchmark splits requirements across visible queries, tool-gated profiles and scripted clarifications.

The BrieftideJune 17, 2026

TL;DR

01A 662-task benchmark splits requirements across visible queries, tool-gated profiles and scripted clarifications.
02Agents must uncover hidden intent, verify candidates against attributes and reviews, and pick one product within 100 tool calls.
03The benchmark models shopping as an investigation rather than a single-query match.

EComAgentBench, a new benchmark from Zeyao Du, Tong Li and Haibo Zhang submitted on 16 Jun 2026, creates 662 shopping tasks that scatter customer requirements across queries, profiles and clarifying interactions. Agents must uncover hidden intent, verify candidates against attributes and reviews, and pick one product within 100 tool calls.

What is EComAgentBench?

EComAgentBench is a 662-task dataset grounded in real Amazon products and reviews that forces agents to handle long-horizon, distributed intent: some requirements are visible in the initial query, some live behind a tool-gated profile, and others appear only through scripted clarification. The authors automated construction so every answer is fixed in code before any text is generated and every sample was validated, and they attach typed, source-tagged rubrics that grade every task and attribute failures to the specific requirement and its source.

The benchmark models shopping as an investigation rather than a single-query match. Each task requires an agent to discover hidden preferences, check candidate products against explicit attributes and review evidence, then commit to a single product decision within a hard cap of 100 tool calls.

How did models perform on the benchmark?

The paper evaluates seven models and finds the strongest model reaches 57.1% overall accuracy, with rubric satisfaction falling as requirement sources move from visible to hidden. That is the concrete ceiling observed in the authors' evaluation: seven models tested, top accuracy 57.1%, and demonstrable degradation when intent originates in profile or clarification rather than the visible query.

The rubric design provides fine-grained attribution for errors. Instead of only scoring final product choice, the benchmark records which requirement an agent missed and whether that requirement came from the visible query, the tool-gated profile or the scripted clarification. This lets researchers see whether failures stem from discovery, verification, or selection steps.

How is the benchmark built and validated?

Construction is automated and deterministic: every answer is fixed in code before any text is generated and every sample is validated. The dataset maps tasks to real Amazon products and review content, and the authors tag rubric items with source metadata so each graded requirement is traceable to the query, profile or clarification that carried it.

This engineering choice aims to make evaluations reproducible and to avoid post-hoc label drift. The tool limit of 100 calls per task constrains agent behavior and mirrors real-system cost or latency concerns while keeping the task scope long enough to require multi-step discovery.

Why it matters

EComAgentBench shifts evaluation from single-shot retrieval to a multi-source discovery problem where missing a buried preference breaks the result. The 662-task scale and the rubric-backed, source-tagged grading let teams separate whether agents fail because they did not ask the right clarification, did not consult the profile, or did not verify evidence. That diagnostic capability matters for deploying assistants that must handle incomplete, evolving customer intent across interactions.

The reported top accuracy of 57.1% across seven models shows substantial headroom. The authors also note that "rubric satisfaction degrades from visible to hidden sources," calling attention to the specific weakness of current agents at recovering non-explicit requirements.

What to watch

Watch for benchmark-driven model work that improves hidden-intent discovery, explicit profile use and evidence verification; the next milestone will be models that close the gap between visible and hidden-source rubric satisfaction. Also look for follow-up public code or leaderboards tied to the paper that enable direct comparison against the seven models the authors evaluated.

References and data points from the paper: authors Zeyao Du, Tong Li, Haibo Zhang; submission date 16 Jun 2026; 662 tasks; grounded in Amazon products and reviews; 100 tool call limit; evaluation of seven models with top accuracy 57.1%; rubric satisfaction noted to degrade from visible to hidden sources.

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

BIM-Edit: Benchmarking LLMs for IFC-based BIM Editing

BIM-Edit evaluates LLMs on 324 IFC editing tasks across 11 real models and 36 synthetic scenes; the top model averages 49.5%.

The BrieftideDAILY BRIEF

ORAgentBench benchmark: LLM agents on 107 OR tasks pass rates

ORAgentBench packages 107 execution-grounded operations research tasks; best agent passed 35.51% overall and 20.59% of hard tasks.

The BrieftideDAILY BRIEF

LLM Agents: Predictive Validity vs Static Leaderboards

Dhaval C. Patel et al. aggregate fourteen implementation studies and seven prior benchmarks and propose ranking by predictive validity.

The BrieftideDAILY BRIEF

CombEval: Benchmarking combinatorial counting in 11 LLMs

CombEval is a dynamic, solver-verified benchmark for combinatorial counting that tests 11 LLMs across varied object types.