DiagFlowBench: 1,676 dialogues test LLMs on off-procedure inputs
Dataset converts 50 industrial flowcharts into 1,676 conversations and measures ten models' abstention and mis-mapping on off-procedure.
TL;DR
- 01Dataset converts 50 industrial flowcharts into 1,676 conversations and measures ten models' abstention and mis-mapping on off-procedure.
- 02The paper, authored by Guillermo Gil de Avalle, Laura Maruster, Shaina Raza and Christos Emmanouilidis, frames the benchmark around grounded diagnostic dialogue in maintenance operations.
- 03The submission appears on arXiv as arXiv:2606.17904 and was posted on 16 Jun 2026.
DiagFlowBench, submitted to arXiv on 16 Jun 2026, converts 50 industrial diagnostic flowcharts from a consumer manufacturer into 1,676 multi-turn conversations that contrast compliant with out-of-scope utterances. The authors evaluated a panel of ten commercial and open-weight models and found high variability in abstention rates, with models often mapping out-of-scope queries to real but contextually inadequate steps rather than fabricating facts.
What is DiagFlowBench?
DiagFlowBench is a benchmark dataset and evaluation designed to test how language models behave when operator queries stray from approved procedural paths: it comprises 50 industrial flowcharts converted into 1,676 multi-turn diagnostic conversations. The paper, authored by Guillermo Gil de Avalle, Laura Maruster, Shaina Raza and Christos Emmanouilidis, frames the benchmark around grounded diagnostic dialogue in maintenance operations.
The authors built the corpus from flowcharts supplied by a consumer manufacturer and explicitly paired compliant conversation turns with out-of-scope utterances so models must either abstain or map the input to a procedural step. The submission appears on arXiv as arXiv:2606.17904 and was posted on 16 Jun 2026.
How were models evaluated and what did they do?
The paper evaluated a panel of ten commercial and open-weight models across the 1,676 multi-turn conversations and measured how often models abstained versus offered steps from the procedure; results show large variability in abstention rates and a recurring behavior. In many cases models chose a real but contextually inadequate step rather than inventing facts, producing advice that is plausible and authoritative yet incorrect for the conversation context.
The benchmark emphasises a practical failure mode for grounded systems: when an operator asks something outside the documented procedure mid-conversation, current models frequently map that input to an existing step instead of signalling out-of-scope. The authors characterise this as a vulnerability because the mapped-but-wrong advice carries inherent plausibility and authority, increasing the risk that an operator will follow incorrect guidance.
Why does DiagFlowBench matter?
DiagFlowBench matters because language models are increasingly used as advisory systems in maintenance operations, and constraining models to procedural documentation is a common mitigation against hallucination. The dataset forces attention on recognition of out-of-scope inputs mid-conversation, a dynamic the authors say current benchmarks rarely prioritise, exposing where grounded systems can fail in operational settings.
If models reliably map off-procedure queries to plausible but wrong steps, designers who rely solely on grounding to suppress hallucination could be left with a different, subtle safety risk: actionable-sounding but inappropriate guidance. That risk affects manufacturers, operators, and teams deploying LLMs into step-driven workflows.
What to watch
Look for follow-up work that measures whether abstention behaviour improves when benchmarks explicitly reward refusing or flagging out-of-scope inputs, and for wider adoption of off-procedure test cases across other industrial datasets. The next concrete signals will be new benchmarks or model updates that report changes in abstention rates on DiagFlowBench-like scenarios.
Authors and submission details: the paper is by Guillermo Gil de Avalle, Laura Maruster, Shaina Raza and Christos Emmanouilidis, submitted 16 Jun 2026 to arXiv as arXiv:2606.17904.
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Benchmarks & EvalsBIM-Edit: Benchmarking LLMs for IFC-based BIM Editing
BIM-Edit evaluates LLMs on 324 IFC editing tasks across 11 real models and 36 synthetic scenes; the top model averages 49.5%.
ORAgentBench benchmark: LLM agents on 107 OR tasks pass rates
ORAgentBench packages 107 execution-grounded operations research tasks; best agent passed 35.51% overall and 20.59% of hard tasks.
LLM Agents: Predictive Validity vs Static Leaderboards
Dhaval C. Patel et al. aggregate fourteen implementation studies and seven prior benchmarks and propose ranking by predictive validity.
SafeClawBench: benchmark separating semantic, audit, sandbox harm
A 600-task staged benchmark measures semantic acceptance, audit-visible evidence.