Foundation Models5 min read

Pre-Flight benchmark: LLM aviation test shows 82.7% top score

Pre-Flight is a 300-question aviation benchmark; the best 2026 model scored 82.7% versus an informal expert reference of about 95%.

The Brieftide

TL;DR

  • 01Pre-Flight is a 300-question aviation benchmark; the best 2026 model scored 82.7% versus an informal expert reference of about 95%.
  • 02Pre-Flight, an open source benchmark authored by Alex Brooker and Tim Hughes, evaluates large language models on 300 multiple choice questions about aviation operational knowledge.
  • 03Submitted to arXiv on 2 Jul 2026, the benchmark finds the strongest model evaluated (released in 2026) reaches 82.7% accuracy on the set, well below an informal expert reference of around 95%.

Pre-Flight, an open source benchmark authored by Alex Brooker and Tim Hughes, evaluates large language models on 300 multiple choice questions about aviation operational knowledge. Submitted to arXiv on 2 Jul 2026, the benchmark finds the strongest model evaluated (released in 2026) reaches 82.7% accuracy on the set, well below an informal expert reference of around 95%.

What is Pre-Flight and what does it test?

Pre-Flight is a 300-question multiple choice benchmark drawn from international standards and airport ground operations material, covering international airport ground operations, ICAO and US FAA regulations, aviation general knowledge and complex operational scenarios. The questions were authored and reviewed by practitioners with experience in air traffic management, ground operations and commercial flying, and the released package includes the dataset, the evaluation harness and the results.

How were models evaluated and how did they perform?

The authors evaluated a range of contemporary commercial and open weight models using the Inspect evaluation framework, scoring by accuracy under a standard multiple choice protocol, and maintain a rolling leaderboard as new models are released. The top-performing model they tested, a model released in 2026, scored 82.7% on the 300-item set. The paper notes model performance improved only gradually from roughly 75% in early 2025 to the 82.7% result, leaving a persistent gap below the cited informal expert reference of about 95%, which the authors obtained from a low sample quiz of aviation professionals at a conference.

The benchmark and evaluation harness are available within the community evaluations package distributed with inspect_evals (UKGovernmentBEIS/inspect_evals). The paper itself is nine pages long and includes one figure and two tables documenting the benchmark and results.

Why does this gap matter?

The persistent gap between model accuracy and the informal expert reference highlights that contemporary LLMs are not yet matching practitioner-level reliability on aviation operational knowledge. The authors argue that general purpose benchmarks do not measure whether a model reasons safely and correctly about domain specific, regulated operational tasks, and they present Pre-Flight as a necessary domain specific evaluation step before deploying generative AI in non safety critical aviation operations.

Domain specific failure modes in aviation can be consequential because the material maps to internationally regulated procedures and ground operations. By releasing the dataset and harness, the authors aim to enable repeatable assessments and a rolling comparison as models change, rather than relying on broad benchmarks that do not probe aviation rules and scenarios.

What to watch

Watch the Pre-Flight rolling leaderboard in inspect_evals for new model entries and accuracy changes; the authors explicitly maintain the leaderboard as models are released. A concrete milestone to track is whether future top models close the gap to the informal expert reference of around 95% accuracy on the 300-question set.

Pre-Flight benchmark key figures
Item
Number of questions300
Top model accuracy (2026)82.7%
Informal expert referencearound 95%
Early 2025 model baselineroughly 75%
Paper length and figures9 pages, 1 figure, 2 tables
Package locationinspect_evals (UKGovernmentBEIS/inspect_evals)
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement