IT-Bench and MAST: IBM and UC Berkeley diagnose agents
IT-Bench and MAST expose common failure modes for enterprise LLM agents and provide targeted tests for connectors, planning, and state.
TL;DR
- 01IT-Bench and MAST expose common failure modes for enterprise LLM agents and provide targeted tests for connectors, planning, and state.
- 02IBM and UC Berkeley released IT-Bench and MAST this week, a paired benchmark and analysis suite designed to diagnose why enterprise LLM agents fail in production.
- 03The two projects target agent behaviours that surface in real-world IT and business workflows, with tests and tooling intended for deployment teams and researchers.
IBM and UC Berkeley released IT-Bench and MAST this week, a paired benchmark and analysis suite designed to diagnose why enterprise LLM agents fail in production. The two projects target agent behaviours that surface in real-world IT and business workflows, with tests and tooling intended for deployment teams and researchers.
IT-Bench is pitched as a benchmark collection that reproduces common enterprise tasks and the environmental conditions that break agents: chained tool use, connector interruptions, state drift across conversations, API error handling, and permission and authentication edge cases. MAST, short for a modular analysis toolkit, complements the benchmark by instrumenting agent runs, mutating inputs and tool responses, and surfacing the failure modes and causal chains that lead to incorrect or brittle behaviour.
What IT-Bench and MAST test
The projects focus on failure classes that are pervasive in production deployments rather than synthetic academic tasks. Examples enumerated by the teams include connector failures where an external service returns an unexpected schema or intermittent errors, planning failures that create infinite loops or redundant steps, memory and state inconsistency across multi-step dialogues, and tool hallucinations where the agent issues invalid API calls or misuses an integrated service.
Both systems simulate realistic enterprise constraints: rate limits, authentication expiries, partial data, and nonstandard document formats. IT-Bench packages scenario-driven tasks that mirror email scheduling, ticket triage, data lookups, and multi-step report generation. MAST provides harnesses to inject faults, replay sessions, and collect detailed traces for each decision point so engineers can pinpoint whether a breakdown began in perception, planning, tool use, or grounding.
The teams published the code and dataset artifacts on Hugging Face, with instructions for running the benchmark against popular agent frameworks and for adding custom connectors or task templates. The release is framework-agnostic, with adapters intended for standard agent orchestrators and common tool interfaces.
How engineering teams can use them
Deployment teams can use IT-Bench as a preflight test suite to validate an agent against likely production hazards before rollout. MAST functions as a debugging companion during incident response: run a failing trace through MAST to classify the failure and see mutated scenarios that reproduce the problem. Combined, the tools let teams compare architectural choices, for example synchronous versus queued tool invocation, or different planning heuristics, by observing how those choices affect failure rates across the benchmark tasks.
The projects also aim to improve observability primitives for agents. MAST encourages fine-grained telemetry of intermediate plans and tool inputs and outputs. That telemetry helps teams create targeted unit tests for connectors, add retries and backoff strategies for flaky services, and implement explicit state reconciliation steps when memory drift is likely.
Practical recommendations packaged with the release include hardening connectors with schema validation, isolating side-effecting tools behind wrappers that validate responses, and enforcing explicit abort or fallback policies in planners. The teams provide example instrumentation that maps a failing end-to-end task to the earliest component where behavior deviated from expected values.
Why it matters
Enterprise agent failures have outsized operational cost because they interact with live services, confidential data, and business processes. By shifting evaluation from isolated language tasks to scenario-driven, fault-injected benchmarks, IT-Bench and MAST help teams surface brittle integrations and systemic failure modes earlier. That focus reduces deployment risk for organizations building production agent workflows and provides a common vocabulary for researchers and operators to compare reliability improvements.
| Item | ||||
|---|---|---|---|---|
| IT-Bench | Benchmark collection | Reproduce enterprise tasks and stress conditions | Connector errors, rate limits, multi-step planning failures, state drift | Preflight testing and comparative evaluation |
| MAST | Analysis toolkit | Instrument runs and inject faults to find root causes | Tool misuse, planning loops, mutated responses, telemetry gaps | Debugging, incident analysis, failure-mode classification |
| Other agent test suites | General benchmarks | Measure language or reasoning metrics | Task performance on curated prompts | Academic comparison, model selection, baseline metrics |
Primary source
Hugging Face
huggingface.coThe Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Read next
- Gemini-SQL2 tops BIRD benchmark with 80.04% accuracyJun 13 · 3 min read
- Claude Fable 5 vs GPT-5.5: FrontierMath toughest-tier scoresJun 13 · 3 min read
- olmo-eval: AllenAI launches evaluation workbench for modelJun 12 · 4 min read
- Claude Fable 5 benchmark: SWE-bench 95% but costly, filteredJun 10 · 4 min read