Benchmarks & Evals4 min readvia Hugging Face

IT-Bench and MAST: IBM and UC Berkeley diagnose agents

IT-Bench and MAST expose common failure modes for enterprise LLM agents and provide targeted tests for connectors, planning, and state.

The Brieftide

TL;DR

  • 01IT-Bench and MAST expose common failure modes for enterprise LLM agents and provide targeted tests for connectors, planning, and state.
  • 02IBM and UC Berkeley released IT-Bench and MAST this week, a paired benchmark and analysis suite designed to diagnose why enterprise LLM agents fail in production.
  • 03The two projects target agent behaviours that surface in real-world IT and business workflows, with tests and tooling intended for deployment teams and researchers.

IBM and UC Berkeley released IT-Bench and MAST this week, a paired benchmark and analysis suite designed to diagnose why enterprise LLM agents fail in production. The two projects target agent behaviours that surface in real-world IT and business workflows, with tests and tooling intended for deployment teams and researchers.

IT-Bench is pitched as a benchmark collection that reproduces common enterprise tasks and the environmental conditions that break agents: chained tool use, connector interruptions, state drift across conversations, API error handling, and permission and authentication edge cases. MAST, short for a modular analysis toolkit, complements the benchmark by instrumenting agent runs, mutating inputs and tool responses, and surfacing the failure modes and causal chains that lead to incorrect or brittle behaviour.

What IT-Bench and MAST test

The projects focus on failure classes that are pervasive in production deployments rather than synthetic academic tasks. Examples enumerated by the teams include connector failures where an external service returns an unexpected schema or intermittent errors, planning failures that create infinite loops or redundant steps, memory and state inconsistency across multi-step dialogues, and tool hallucinations where the agent issues invalid API calls or misuses an integrated service.

Both systems simulate realistic enterprise constraints: rate limits, authentication expiries, partial data, and nonstandard document formats. IT-Bench packages scenario-driven tasks that mirror email scheduling, ticket triage, data lookups, and multi-step report generation. MAST provides harnesses to inject faults, replay sessions, and collect detailed traces for each decision point so engineers can pinpoint whether a breakdown began in perception, planning, tool use, or grounding.

The teams published the code and dataset artifacts on Hugging Face, with instructions for running the benchmark against popular agent frameworks and for adding custom connectors or task templates. The release is framework-agnostic, with adapters intended for standard agent orchestrators and common tool interfaces.

How engineering teams can use them

Deployment teams can use IT-Bench as a preflight test suite to validate an agent against likely production hazards before rollout. MAST functions as a debugging companion during incident response: run a failing trace through MAST to classify the failure and see mutated scenarios that reproduce the problem. Combined, the tools let teams compare architectural choices, for example synchronous versus queued tool invocation, or different planning heuristics, by observing how those choices affect failure rates across the benchmark tasks.

The projects also aim to improve observability primitives for agents. MAST encourages fine-grained telemetry of intermediate plans and tool inputs and outputs. That telemetry helps teams create targeted unit tests for connectors, add retries and backoff strategies for flaky services, and implement explicit state reconciliation steps when memory drift is likely.

Practical recommendations packaged with the release include hardening connectors with schema validation, isolating side-effecting tools behind wrappers that validate responses, and enforcing explicit abort or fallback policies in planners. The teams provide example instrumentation that maps a failing end-to-end task to the earliest component where behavior deviated from expected values.

Why it matters

Enterprise agent failures have outsized operational cost because they interact with live services, confidential data, and business processes. By shifting evaluation from isolated language tasks to scenario-driven, fault-injected benchmarks, IT-Bench and MAST help teams surface brittle integrations and systemic failure modes earlier. That focus reduces deployment risk for organizations building production agent workflows and provides a common vocabulary for researchers and operators to compare reliability improvements.

IT-Bench vs MAST: focus and usage
Item
IT-BenchBenchmark collectionReproduce enterprise tasks and stress conditionsConnector errors, rate limits, multi-step planning failures, state driftPreflight testing and comparative evaluation
MASTAnalysis toolkitInstrument runs and inject faults to find root causesTool misuse, planning loops, mutated responses, telemetry gapsDebugging, incident analysis, failure-mode classification
Other agent test suitesGeneral benchmarksMeasure language or reasoning metricsTask performance on curated promptsAcademic comparison, model selection, baseline metrics

Primary source

Hugging Face

huggingface.co
Read the original

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeNo adsNo trackingUnsubscribe in one click

Read next

  1. Gemini-SQL2 tops BIRD benchmark with 80.04% accuracyJun 13 · 3 min read
  2. Claude Fable 5 vs GPT-5.5: FrontierMath toughest-tier scoresJun 13 · 3 min read
  3. olmo-eval: AllenAI launches evaluation workbench for modelJun 12 · 4 min read
  4. Claude Fable 5 benchmark: SWE-bench 95% but costly, filteredJun 10 · 4 min read