Enterprise AI Adoption4 min read

DiscoBench benchmark: AI agents fail to ask clarifying questions

Tencent Hunyuan and Tsinghua’s DiscoBench finds leading models under 50% end-to-end accuracy because they guess instead of asking.

The Brieftide

TL;DR

  • 01Tencent Hunyuan and Tsinghua’s DiscoBench finds leading models under 50% end-to-end accuracy because they guess instead of asking.
  • 02Tencent Hunyuan and Tsinghua University released DiscoBench on Jul 5, 2026, a benchmark that tests whether AI search agents proactively ask users for clarification when queries are ambiguous.
  • 03The dataset contains 211 tasks with 463 ambiguous points across eleven knowledge domains and runs queries through the Tavily search engine with a Gemini 3 Flash user simulator.

Tencent Hunyuan and Tsinghua University released DiscoBench on Jul 5, 2026, a benchmark that tests whether AI search agents proactively ask users for clarification when queries are ambiguous. The dataset contains 211 tasks with 463 ambiguous points across eleven knowledge domains and runs queries through the Tavily search engine with a Gemini 3 Flash user simulator.

What is DiscoBench and how does it work?

DiscoBench evaluates whether an agent can detect ambiguity, ask a targeted follow-up, and correct its multi-step research path, with each task split into multiple checkpoints where the agent can search, ask, or answer. The benchmark injects four types of ambiguity—entity, temporal/version, criteria/ranking, and factual error—then releases a predefined clue when a simulator-answered follow-up narrows the search.

The pipeline first builds clean multi-hop questions, injects targeted ambiguities, and uses the simulator to hand out clues after useful follow-ups. The dataset is mostly written in Chinese and measures agents across four metric groups from task success to cost efficiency, so DiscoBench captures how unresolved ambiguity compounds across long reasoning chains rather than testing retrieval in isolation.

How did leading models perform on DiscoBench?

Performance was weak: eleven recent models averaged low end-to-end accuracy, with the top models still below 50 percent on full-chain success. Doubao Seed 2.0 Pro reached the highest end-to-end accuracy at 43.1 percent, Gemini 3.1 Pro hit 40.8 percent, and Claude Opus 4.7 reached 39.8 percent, while MiniMax M2.7 and Qwen3.6 Max managed 16.1 and 12.3 percent respectively.

The benchmark separates step-level competence from chain-level success: Claude Opus 4.7 solved 57 percent of checkpoints but only reached 39.8 percent end-to-end, showing that a single unresolved ambiguity can collapse a long search. Guided prompts that explicitly told agents to watch for ambiguity raised average end-to-end accuracy from 28.6 to 33.7 percent across ten models and lifted Detection F1 from 45.3 to 64.9 percent, but that hint alone did not convert detection into consistent successful research.

What behaviors drive success or failure?

DiscoBench breaks agents into behavioral profiles and shows that asking at the right time matters more than searching harder: agents that searched first and then asked a follow-up, the SearchThenAsk profile, averaged a 93.4 percent success rate, while DirectGuess (guess without asking) fell to 56.5 percent and SearchHeavyGuess (repeated searches then guess) averaged 51.9 percent. Detection ability and question quality also diverge: Qwen3.6 Max reached only a 16 percent Detection F1 and asked 0.07 follow-ups per task on average but produced high-quality questions when it did, while MiniMax M2.7 asked more often but had lower follow-through rates.

DiscoBench shows that factual errors are easiest for models to detect because they create contradictions during research, while entity and criteria ambiguities are much harder because multiple plausible candidates or unclear evaluation standards can coexist without immediate contradiction. Models collapse without access to search tools: Doubao Seed 2.0 Pro drops from 43.1 to 2.4 percent without tools, and Gemini 3.1 Pro drops from 40.8 to 19.9 percent, underlining that DiscoBench cannot be solved from stored model knowledge alone.

Why does this matter?

DiscoBench shifts attention from retrieval accuracy to interaction strategy: the core failure mode is not finding facts but recognizing uncertainty and converting it into a clarifying question that moves the chain forward. That matters for any product that chains many lookups together because a single wrong choice early in the chain can produce confidently wrong final answers, and more tool calls do not help if they never trigger the right user interaction.

Recent model updates and alternative designs already target this gap: Anthropic’s Claude Opus 4.8 is intended to flag uncertainties more often, and Perplexity’s Search as Code experiments change how workflows are expressed. DiscoBench gives concrete metrics to measure whether those approaches actually raise end-to-end success.

What to watch

Measure whether future releases increase the share of SearchThenAsk behavior, which DiscoBench shows yields a 93.4 percent success rate, and whether model updates push end-to-end accuracy above the current top scores (Doubao Seed 2.0 Pro at 43.1 percent, Gemini 3.1 Pro at 40.8 percent, Claude Opus 4.7 at 39.8 percent). Also watch whether systems couple better ambiguity detection with question quality so that higher Detection F1 rates translate into higher final-task accuracy.

Advertisement

Written by The Brieftide · Source: The Decoder

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement