VAKRA benchmark: IBM tests agent reasoning, tool failures
IBM Research's VAKRA benchmark evaluates agent reasoning, tool use and common failure modes across LLM-driven agents.
TL;DR
- 01IBM Research's VAKRA benchmark evaluates agent reasoning, tool use and common failure modes across LLM-driven agents.
- 02IBM Research published VAKRA, a benchmark designed to evaluate reasoning, tool use)))))))))))))), and failure modes in agent systems built on large language models.
- 03The suite exercises planning, API and tool selection, multi-step tool invocation)), and recovery from errors at scale to reveal how contemporary agent designs succeed and fail in realistic workflows.
IBM Research published VAKRA, a benchmark designed to evaluate reasoning, tool use, and failure modes in agent systems built on large language models. The suite exercises planning, API and tool selection, multi-step tool invocation, and recovery from errors at scale to reveal how contemporary agent designs succeed and fail in realistic workflows.
VAKRA assembles task families that require agents to call external tools, construct multi-step plans, and adapt to partial or changing information. The benchmark mixes synthetic tasks with grounded, API-driven scenarios to test both reasoning and integration behavior. Implementations tested include agent frameworks wired to open-source LLMs, models behind proprietary APIs, and hybrid setups that combine chain-of-thought prompting with external tool wrappers.
Benchmark design and tested agents
The benchmark groups tasks by capability: single-call tool use, multi-step tool orchestration, parameterized API calls, and error handling when tools return unexpected results. Tasks vary by domain, covering data lookups, code execution, document retrieval and controlled environment manipulation. Scenarios include deterministic steps where the correct sequence is known, and open-ended planning where multiple valid approaches exist.
VAKRA evaluates agents on objective signals: whether the agent issues the correct tool call, whether it supplies valid parameters, whether it completes required steps in order, and whether it recovers or falls back on failure. The benchmark also captures intermediate reasoning traces when available, enabling analysis of how internal plans map to external actions. IBM ran VAKRA across a selection of agent architectures, comparing open-source LLMs against closed API models and simple rule-based or scripted baselines.
Key failure modes observed
Common failures fall into several categories. First, hallucinated tool use: agents sometimes fabricate calls or invoke nonexistent endpoints, producing plausible but invalid sequences. Second, parameter errors: even when the right tool is chosen, agents frequently supply incorrect parameter values or formats, which causes downstream failures. Third, brittle planning: multi-step tasks expose fragile plan decomposition, where an early misstep or a minor misordering leads to task collapse.
Another observed class of errors is confirmation blindness, where agents do not validate tool outputs against expectations or the original query, propagating incorrect results without recovery. Non-deterministic behavior emerged as a practical problem for repeatability: the same prompt plus context occasionally produced different action plans, complicating integration testing. Finally, tool-API mismatches surfaced when models assumed richer tool semantics than provided, for example expecting transactional guarantees or idempotent behavior that the tool did not offer.
VAKRA also highlights mitigations that improve outcomes. Structured prompting and explicit action vocabularies reduce hallucinated calls. Parameter validation checks and lightweight execution sandboxes catch common format errors before they cascade. Agents that include explicit verification steps, such as confirming tool outputs against a derived expectation, recover from failures more often than those that do not.
Why it matters
VAKRA clarifies that high-level reasoning abilities do not automatically translate to robust tool use in production; integration gaps and format-level mistakes drive many real failures. The benchmark gives engineers a focused way to measure those gaps, prioritize mitigations like parameter validation and verification steps, and compare agent architectures on concrete, actionable metrics. As agents move into application workflows, these failure modes will determine reliability and operational cost more than raw language-model fluency.
| Item | ||||
|---|---|---|---|---|
| Planning success | Medium | High | Low | |
| Tool-call correctness | Medium | Medium | High | |
| API parameter accuracy | Low | Medium | High | |
| Error recovery | Low | Medium | Low | |
| Repeatability | Medium | Low | High |
Written by The Brieftide · Source: Hugging Face
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Benchmarks & EvalsSWE-Explore benchmark: AI coding agents find files but miss lines
SWE-Explore shows Claude Code, Codex and peers usually locate the correct source file but fail to return the exact lines needed for fixes.
olmo-eval: AllenAI launches evaluation workbench for model
Open-source olmo-eval bundles dataset connectors, metric runners and reproducible evaluation for iterative model testing.
Claude Fable 5 benchmark: SWE-bench 95% but costly, filtered
Anthropic's Mythos Claude Fable 5 scores 95% on SWE-bench Verified, tops public tests but adds heavy safety filters and higher cost.
Anthropic releases Claude Fable 5 and Mythos 5 with coding gains
Anthropic says the new Claude Fable 5 and Mythos 5 outperform the Opus generation on coding and scientific benchmarks.