Pre-Flight benchmark: LLM aviation test shows 82.7% top score
Pre-Flight is a 300-question aviation benchmark; the best 2026 model scored 82.7% versus an informal expert reference of about 95%.
TL;DR
- 01Pre-Flight is a 300-question aviation benchmark; the best 2026 model scored 82.7% versus an informal expert reference of about 95%.
- 02Pre-Flight, an open source benchmark authored by Alex Brooker and Tim Hughes, evaluates large language models on 300 multiple choice questions about aviation operational knowledge.
- 03Submitted to arXiv on 2 Jul 2026, the benchmark finds the strongest model evaluated (released in 2026) reaches 82.7% accuracy on the set, well below an informal expert reference of around 95%.
Pre-Flight, an open source benchmark authored by Alex Brooker and Tim Hughes, evaluates large language models on 300 multiple choice questions about aviation operational knowledge. Submitted to arXiv on 2 Jul 2026, the benchmark finds the strongest model evaluated (released in 2026) reaches 82.7% accuracy on the set, well below an informal expert reference of around 95%.
What is Pre-Flight and what does it test?
Pre-Flight is a 300-question multiple choice benchmark drawn from international standards and airport ground operations material, covering international airport ground operations, ICAO and US FAA regulations, aviation general knowledge and complex operational scenarios. The questions were authored and reviewed by practitioners with experience in air traffic management, ground operations and commercial flying, and the released package includes the dataset, the evaluation harness and the results.
How were models evaluated and how did they perform?
The authors evaluated a range of contemporary commercial and open weight models using the Inspect evaluation framework, scoring by accuracy under a standard multiple choice protocol, and maintain a rolling leaderboard as new models are released. The top-performing model they tested, a model released in 2026, scored 82.7% on the 300-item set. The paper notes model performance improved only gradually from roughly 75% in early 2025 to the 82.7% result, leaving a persistent gap below the cited informal expert reference of about 95%, which the authors obtained from a low sample quiz of aviation professionals at a conference.
The benchmark and evaluation harness are available within the community evaluations package distributed with inspect_evals (UKGovernmentBEIS/inspect_evals). The paper itself is nine pages long and includes one figure and two tables documenting the benchmark and results.
Why does this gap matter?
The persistent gap between model accuracy and the informal expert reference highlights that contemporary LLMs are not yet matching practitioner-level reliability on aviation operational knowledge. The authors argue that general purpose benchmarks do not measure whether a model reasons safely and correctly about domain specific, regulated operational tasks, and they present Pre-Flight as a necessary domain specific evaluation step before deploying generative AI in non safety critical aviation operations.
Domain specific failure modes in aviation can be consequential because the material maps to internationally regulated procedures and ground operations. By releasing the dataset and harness, the authors aim to enable repeatable assessments and a rolling comparison as models change, rather than relying on broad benchmarks that do not probe aviation rules and scenarios.
What to watch
Watch the Pre-Flight rolling leaderboard in inspect_evals for new model entries and accuracy changes; the authors explicitly maintain the leaderboard as models are released. A concrete milestone to track is whether future top models close the gap to the informal expert reference of around 95% accuracy on the 300-question set.
| Item | ||
|---|---|---|
| Number of questions | 300 | |
| Top model accuracy (2026) | 82.7% | |
| Informal expert reference | around 95% | |
| Early 2025 model baseline | roughly 75% | |
| Paper length and figures | 9 pages, 1 figure, 2 tables | |
| Package location | inspect_evals (UKGovernmentBEIS/inspect_evals) |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Foundation ModelsEinstein World Models: LLMs with visual rollouts (arXiv 2026)
An arXiv paper submitted 25 Jun 2026 proposes Einstein World Models, letting LLMs call visual-temporal rollouts as inspectable hypotheses.
KARLA: KB-augmented retrieval for language models paper
arXiv paper (25 Jun 2026) by Francois Crespin, Fabian M. Suchanek and Nils Holzenberger shows LLMs can query a knowledge base during token.
Synthetic clinical notes from LLMs: 70-patient longitudinal
William Poulett publishes a modular LLM pipeline and a synthetic dataset of 70 patients.
Capability Frontier: Benchmarks Miss 82% of LLM Performance
An arXiv paper finds single-model, single-run benchmarks undercount LLM ability; an oracle multi-model approach recovers 82% more.