ToolMenuBench: Benchmarking LLM Tool-Menu Filtering Strategies
ToolMenuBench compares six filtering methods across seven model backends.
TL;DR
- 01ToolMenuBench compares six filtering methods across seven model backends.
- 02ToolMenuBench introduces a targeted evaluation for which tools an LLM agent) should see and when, testing how visible tool menus affect reliability, efficiency, and safety-relevant risk exposure.
- 03The benchmark runs controlled experiments across seven model backends, three tool-menu sizes, six filtering methods, and seven evaluation settings.
ToolMenuBench introduces a targeted evaluation for which tools an LLM agent should see and when, testing how visible tool menus affect reliability, efficiency, and safety-relevant risk exposure. The benchmark runs controlled experiments across seven model backends, three tool-menu sizes, six filtering methods, and seven evaluation settings.
What is ToolMenuBench?
ToolMenuBench is a benchmark for evaluating tool-menu construction in multi-step LLM agents, measuring both filter-level and downstream agent outcomes. It varies tool-menu size, distractor type, state-dependent task structure, and risk exposure, and reports metrics including visible-tool count, risky-tool exposure, task success, wrong-tool calls, premature actions, and token usage.
ToolMenuBench frames the agent-interface problem as a concrete evaluation: which tools should be visible, when they should be visible, and under what cost or risk constraints. The authors position the benchmark as reusable for studying tool-menu strategies rather than solely checking whether a model can call a tool correctly.
How was the benchmark run and what were the core results?
The controlled evaluation used seven model backends, three tool-menu sizes, six filtering methods, and seven evaluation settings, and then compared filter-level outputs and downstream agent behavior. Key reported metrics are visible-tool count, risky-tool exposure, task success, wrong-tool calls, premature actions, and token usage.
The most striking numeric result: causal minimal tool filtering, abbreviated CMTF, improves task success from 32.1% under all-tools exposure to 85.7%. The paper also reports that CMTF reduces average token usage by roughly 98% relative to unfiltered exposure. Beyond those concrete numbers, the authors state that causal minimal tool filtering reduces visible tools, wrong-tool calls, premature actions, and risky-tool exposure relative to unfiltered exposure, lexical filtering, state-aware filtering, and broader causal-path baselines.
Which filtering methods were compared and how do they differ?
The benchmark evaluates six filtering methods, contrasting unfiltered exposure and lexical and state-aware approaches against causal-path and causal minimal filtering strategies. The evaluation isolates how menu construction choices change both what the agent sees and how it behaves across multi-step tasks.
ToolMenuBench explicitly manipulates distractor types and state-dependent task structure to see when different filters expose risky tools or prompt wrong-tool calls and premature actions. The paper measures both the visible-tool count presented to the agent and downstream effects such as task success and token consumption, allowing direct tradeoff comparisons between broader visibility and narrower, risk-aware menus.
Why it matters
Tool menus are an interface decision that shapes agent behavior, not just an engineering convenience. ToolMenuBench provides empirical evidence that menu filtering can dramatically change outcomes: the benchmark shows a move from 32.1% to 85.7% task success when applying causal minimal filtering and reports about a 98% cut in average token usage. Those numbers imply menu design can affect both reliability and operational cost in multi-step agent workflows.
The benchmark also reframes safety exposure as a visible-menu design problem. By measuring risky-tool exposure directly, ToolMenuBench lets researchers and engineers quantify how menu choices increase or decrease access to risky capabilities, rather than leaving that assessment implicit.
What to watch
See whether CMTF-style filtering replicates outside the seven model backends and seven evaluation settings used in this study, and whether the roughly 98% reduction in token usage holds across larger tool libraries and real-world agent deployments. Adoption of ToolMenuBench by other teams will confirm if menu-construction gains generalize beyond the paper's controlled experiments.
ToolMenuBench supplies a reusable framework for those next tests, so the next concrete signal will be follow-on studies that report the same filter-level and downstream metrics on different models and tool collections.
| Item | |||
|---|---|---|---|
| Task success | 32.1% | 85.7% | |
| Average token usage | baseline | reduced by roughly 98% | |
| Visible-tool count | many | reduced | |
| Wrong-tool calls | higher | reduced | |
| Premature actions | higher | reduced | |
| Risky-tool exposure | higher | reduced |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Benchmarks & EvalsBIM-Edit: Benchmarking LLMs for IFC-based BIM Editing
BIM-Edit evaluates LLMs on 324 IFC editing tasks across 11 real models and 36 synthetic scenes; the top model averages 49.5%.
ORAgentBench benchmark: LLM agents on 107 OR tasks pass rates
ORAgentBench packages 107 execution-grounded operations research tasks; best agent passed 35.51% overall and 20.59% of hard tasks.
LLM Agents: Predictive Validity vs Static Leaderboards
Dhaval C. Patel et al. aggregate fourteen implementation studies and seven prior benchmarks and propose ranking by predictive validity.
SafeClawBench: benchmark separating semantic, audit, sandbox harm
A 600-task staged benchmark measures semantic acceptance, audit-visible evidence.