Benchmarks & EvalsJune 16, 20264 min read

IRTS-ToolBench: benchmark for irregular Time Series QA

A 1,700-question benchmark across 10 task types and 13 domains for LLM-based irregular time series analysis with a reproducible protocol.

The BrieftideJune 16, 2026

TL;DR

01A 1,700-question benchmark across 10 task types and 13 domains for LLM-based irregular time series analysis with a reproducible protocol.
02IRTS-ToolBench is a new benchmark introduced to probe how large language models and AI agents))-mode) handle irregular time series.
03The benchmark contains 1,700 questions spanning 10 task types across 13 domains, and the authors provide code linked in the submission.

IRTS-ToolBench is a new benchmark introduced to probe how large language models and AI agents handle irregular time series. The paper, titled "Towards Verifiable Agentic Data Science: Solving Irregular TSQA Via Tool-Grounded Reasoning," was submitted to arXiv on 13 Jun 2026 by Sanhorn Chen, Xiaoyang Chen, Boyu Liu, and Roy Zhao. The benchmark contains 1,700 questions spanning 10 task types across 13 domains, and the authors provide code linked in the submission.

What the paper introduces

The authors frame the problem around real-world time series data, which they describe as overwhelmingly irregular: observations are asynchronous, missing values can be informative rather than random, and sampling frequencies vary across sensors and operational windows. To address this, they introduce IRTS-ToolBench, a dataset and evaluation protocol intended for irregular Time Series Question Answering, or TSQA. The benchmark is explicitly designed to be used independently by any researcher working on LLM-based irregular time series analysis, and the paper includes a reproducible evaluation protocol.

IRTS-ToolBench comprises 1,700 questions. Those questions cover 10 distinct task types and come from 13 different domains. The submission notes that code for the benchmark is available via a URL included in the paper.

How this differs from prior TSQA benchmarks

The paper highlights a gap in existing TSQA evaluations: most current benchmarks assume regularly sampled inputs. The authors argue that this assumption leaves a fundamental gap in understanding model and agent behavior under the irregular conditions common in deployed systems. By targeting asynchronous observations, informative missingness, and variable sampling rates, IRTS-ToolBench aims to force models and agentic pipelines to confront those real-world irregularities.

The title and framing indicate the authors also prioritize verifiability and tool-grounded reasoning. While the submission is 15 pages long, the abstract concentrates on the benchmark design and the stated aim of supporting verifiable, agentic data science workflows that rely on LLMs and external tools when reasoning about irregular time series.

Why it matters

Benchmarks shape what researchers optimize for. Existing TSQA work that assumes regular sampling risks producing methods that break in operational settings. IRTS-ToolBench directly targets that mismatch by providing standardized inputs and a reproducible protocol tailored to irregular data. That makes it easier to compare approaches that incorporate tool use, data preprocessing strategies, or agentic decision making when inputs are irregular, asynchronous, or sparsely observed.

Providing a publicly accessible code base alongside the benchmark lowers the friction for independent evaluation and replication. If researchers adopt the dataset and protocol, comparisons between LLM-based agents and other methods will be more meaningful for real deployments where time series rarely behave like textbook examples.

What to watch

Watch for the code and dataset linked in the paper to appear in public repositories and for early independent evaluations using IRTS-ToolBench. Also track whether subsequent TSQA work shifts away from regular-sampling assumptions and reports results across the 10 task types and 13 domains defined by this benchmark.

Written by The Brieftide · Source: arXiv