AA-Briefcase benchmark: Claude Fable 5 tops, only 3% solved
AA-Briefcase runs multi-week knowledge projects from thousands of fragmented files; Anthropic's Claude Fable 5 meets every rubric on just 3.
TL;DR
- 01AA-Briefcase runs multi-week knowledge projects from thousands of fragmented files; Anthropic's Claude Fable 5 meets every rubric on just 3.
- 02AA-Briefcase, a new benchmark from Artificial Analysis, finds that even the best AI model fully solves just 3 percent of realistic knowledge work tasks.
- 03AA-Briefcase evaluates models on multi-week knowledge work projects built from thousands of fragmented source files, including Slack threads, emails, meeting transcripts, and large data exports.
AA-Briefcase, a new benchmark from Artificial Analysis, finds that even the best AI model fully solves just 3 percent of realistic knowledge work tasks. The benchmark assembles multi-week projects from thousands of fragmented source files such as Slack threads, emails, meeting transcripts, and large data exports to test how models perform on real knowledge work.
What does AA-Briefcase test?
AA-Briefcase evaluates models on multi-week knowledge work projects built from thousands of fragmented source files, including Slack threads, emails, meeting transcripts, and large data exports. The benchmark recreates the fractured information environment that typifies real knowledge work, then scores models against rubrics that require finding relevant files, synthesizing across sources, and meeting detailed task criteria.
That setup forces models to do more than produce a plausible answer. The tasks demand sustained reading, file selection, cross-referencing and precise execution against rubric items, rather than single-turn question answering.
How do models perform on the tasks?
The top performer, Anthropic's Claude Fable 5, achieves the highest rubric pass rate but fully solves all rubric criteria on only 3 percent of tasks. On 31 out of 91 tasks, no model clears 50 percent of the rubric criteria.
Performance breaks down into two broad failure modes. Weaker models tend to miss relevant files or produce unusable outputs, effectively failing basic execution. Stronger models reach the obvious requirements but miss details that only appear when piecing together information from multiple sources. The result is quieter errors that leave deliverables apparently acceptable while omitting key facts.
AA-Briefcase also highlights a steep price-performance spread. Per-task costs span more than 800x, from about $0.04 for DeepSeek V4 Flash to over $31 for Claude Fable 5. That gap separates cheaper, lower-performing models from costlier, higher-scoring ones but does not erase the central finding: even the costliest model fully satisfies the rubric on only 3 percent of tasks.
Why it matters
The benchmark shows that current models struggle with the sustained, fragmented information work typical of many professional roles. If a top model fully meets every rubric criterion on 3 percent of tasks, organizations should expect gaps when relying on models for end-to-end knowledge work that requires assembling evidence across many documents. The cost variation further complicates adoption: higher expense does not translate into comprehensive correctness, it only narrows some failure modes while leaving cross-document synthesis brittle.
Operational teams and purchasers must therefore separate high-level usefulness from full reliability. Models that produce seemingly correct summaries may omit details that matter for decisions or compliance. The AA-Briefcase results argue for continued human oversight, stronger retrieval and chaining systems, and benchmarks that reflect the messy data environments teams actually face.
What to watch
Watch whether any vendor can materially close the 800x per-task cost gap while improving cross-document synthesis on the 31 tasks where no model reaches 50 percent. Progress would show up as a rising rubric pass rate across tasks that currently defeat all models and a narrowing of the difference between noisy, low-cost outputs and the higher-cost, higher-scoring systems.
AA-Briefcase sets a clear technical target: reduce silent, detail-level failures in multi-source projects and raise the share of tasks that models fully solve well above the current 3 percent figure.
Written by The Brieftide · Source: The Decoder
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Enterprise AI AdoptionMulti-Agent Orchestration for Enterprise AI: arXiv Paper
An arXiv paper (18 Jun 2026) evaluates DAG Plan and Execute versus ReAct across 208 enterprise scenarios and adds a Task Manager that cuts.
ChatGPT Enterprise: new spend controls and usage analytics
OpenAI added spend controls and usage analytics to ChatGPT Enterprise to help organizations manage costs and scale AI.
NEA's Tiffany Luck: AI IPOs, personal agents and ROI reckoning
NEA partner Tiffany Luck on AI IPOs, personal agents, and the tokenmaxxing-to-ROI shift in enterprise AI spend.
OpenAI Partner Network launch: $150M fund to scale enterprise AI
OpenAI commits $150M to a Partner Network to help global partners accelerate enterprise AI adoption, deployment and transformation.