Gemini 3 Flash tops GPT-5 mini and DeepSeek Chat 3.2 on Scrum Qs
Three LLMs answered 993 PSM I–style questions; Gemini 3 Flash had highest accuracy, while multi-select and True/False were most error-prone.
TL;DR
- 01Three LLMs answered 993 PSM I–style questions; Gemini 3 Flash had highest accuracy, while multi-select and True/False were most error-prone.
- 02The paper tests accuracy, intra-model stability and error patterns using zero-shot, chain-of-thought and source-grounded prompting, with repeated executions to gauge variability.
- 03The dataset and setup target certification-style items where strict adherence to normative Scrum definitions, roles, artifacts and rules matters.
Three large language models, GPT-5 mini, Gemini 3 Flash and DeepSeek Chat 3.2, were evaluated on 993 Scrum certification-style questions in a study submitted to arXiv on 29 Jun 2026 (arXiv:2607.00048). The paper tests accuracy, intra-model stability and error patterns using zero-shot, chain-of-thought and source-grounded prompting, with repeated executions to gauge variability.
What did the paper test and how?
The authors evaluated the three contemporary LLMs on 993 questions aligned with the Professional Scrum Master I (PSM I) assessment format, applying three prompting strategies (zero-shot, chain-of-thought and source-grounded) and repeating runs to measure intra-model stability. The dataset and setup target certification-style items where strict adherence to normative Scrum definitions, roles, artifacts and rules matters.
The study's experimental design measured model accuracy across question formats and Scrum topics, and complemented quantitative results with a qualitative analysis of recurring error patterns in incorrect answers. The paper lists eight authors: Robson Alves Vilar, Emanuel Dantas Filho, Ademar França de Sousa Neto, Mirko Perkusich, Danyllo Wagner Albuquerque, João Paiva, Kyller Gorgônio and Angelo Perkusich.
How did the models perform across questions and topics?
Gemini 3 Flash achieved the highest accuracy, GPT-5 mini ranked second, and DeepSeek Chat 3.2 ranked third, while intra-model variability remained low across all prompting conditions. By question format, the models were most accurate on single-answer multiple-choice items and more error-prone on multi-select and True/False questions.
Performance varied by Scrum topic. The models were more consistent in normatively explicit areas such as Artifacts, Empiricism and Product Value, but showed fragility in Scrum Values, Self-Managing Teams, and Stakeholders & Customers. The paper emphasizes that errors were systematic rather than random and identifies recurring patterns including overgeneralization, restrictive wording, compound distractors, and conflicts between common market interpretations and strict Scrum definitions.
Why it matters
Certification-style questions probe precise, normative knowledge rather than loose factual recall, so systematic LLM errors on those items expose where model outputs diverge from formal definitions. That divergence matters for organizations using LLMs for exam preparation, automated assessment or decision support in governed practices, because consistent error patterns are easier to diagnose but harder to mask than random mistakes.
The low intra-model variability reported by the authors indicates reproducible behaviours across repeated executions, which makes both the strengths and the weaknesses of each model predictable in applied settings.
What to watch
Look for follow-up studies that publish numeric accuracy results per model and per question format, or that extend the 993-item PSM I–style set to other certification bodies and normative domains. A concrete next signal will be papers or benchmarks that quantify the same models' performance on expanded Scrum-topic splits or report remediation strategies for the systematic error types this study identifies.
| Item | |||
|---|---|---|---|
| Gemini 3 Flash | 1 | Highest accuracy; low intra-model variability | |
| GPT-5 mini | 2 | Second-highest accuracy; tested with zero-shot, chain-of-thought and source-grounded prompts | |
| DeepSeek Chat 3.2 | 3 | Lowest accuracy among the three; systematic errors like overgeneralization noted |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIMMIR-TCM: multimodal TCM AI framework outperforms GPT-4o, Gemini
MMIR-TCM pairs Memory-SAM, fine-tuned Qwen3-VL and a Qwen3 RAG pipeline.
MIT Masked IRL: LLMs help robots clarify and ignore cues
MIT’s Masked IRL uses two LLMs to clarify vague prompts, cut demonstration data nearly fivefold.
Multimodal LLM evaluation: four missing capabilities (2026)
A paper by Po-han Li et al. finds benchmarks miss temporal-spatial coherence, physical-world understanding.
ReMMD: Multilingual Multi-Image Benchmark and Agent Release
ReMMD introduces ReMMDBench (500 samples, 2,756 images) and ReMMD-Agent; GPT-5.2 yields 41.80% accuracy and 39.12% macro-F1.