Multimodal AI4 min read

Gemini 3 Flash tops GPT-5 mini and DeepSeek Chat 3.2 on Scrum Qs

Three LLMs answered 993 PSM I–style questions; Gemini 3 Flash had highest accuracy, while multi-select and True/False were most error-prone.

The Brieftide

TL;DR

  • 01Three LLMs answered 993 PSM I–style questions; Gemini 3 Flash had highest accuracy, while multi-select and True/False were most error-prone.
  • 02The paper tests accuracy, intra-model stability and error patterns using zero-shot, chain-of-thought and source-grounded prompting, with repeated executions to gauge variability.
  • 03The dataset and setup target certification-style items where strict adherence to normative Scrum definitions, roles, artifacts and rules matters.

Three large language models, GPT-5 mini, Gemini 3 Flash and DeepSeek Chat 3.2, were evaluated on 993 Scrum certification-style questions in a study submitted to arXiv on 29 Jun 2026 (arXiv:2607.00048). The paper tests accuracy, intra-model stability and error patterns using zero-shot, chain-of-thought and source-grounded prompting, with repeated executions to gauge variability.

What did the paper test and how?

The authors evaluated the three contemporary LLMs on 993 questions aligned with the Professional Scrum Master I (PSM I) assessment format, applying three prompting strategies (zero-shot, chain-of-thought and source-grounded) and repeating runs to measure intra-model stability. The dataset and setup target certification-style items where strict adherence to normative Scrum definitions, roles, artifacts and rules matters.

The study's experimental design measured model accuracy across question formats and Scrum topics, and complemented quantitative results with a qualitative analysis of recurring error patterns in incorrect answers. The paper lists eight authors: Robson Alves Vilar, Emanuel Dantas Filho, Ademar França de Sousa Neto, Mirko Perkusich, Danyllo Wagner Albuquerque, João Paiva, Kyller Gorgônio and Angelo Perkusich.

How did the models perform across questions and topics?

Gemini 3 Flash achieved the highest accuracy, GPT-5 mini ranked second, and DeepSeek Chat 3.2 ranked third, while intra-model variability remained low across all prompting conditions. By question format, the models were most accurate on single-answer multiple-choice items and more error-prone on multi-select and True/False questions.

Performance varied by Scrum topic. The models were more consistent in normatively explicit areas such as Artifacts, Empiricism and Product Value, but showed fragility in Scrum Values, Self-Managing Teams, and Stakeholders & Customers. The paper emphasizes that errors were systematic rather than random and identifies recurring patterns including overgeneralization, restrictive wording, compound distractors, and conflicts between common market interpretations and strict Scrum definitions.

Why it matters

Certification-style questions probe precise, normative knowledge rather than loose factual recall, so systematic LLM errors on those items expose where model outputs diverge from formal definitions. That divergence matters for organizations using LLMs for exam preparation, automated assessment or decision support in governed practices, because consistent error patterns are easier to diagnose but harder to mask than random mistakes.

The low intra-model variability reported by the authors indicates reproducible behaviours across repeated executions, which makes both the strengths and the weaknesses of each model predictable in applied settings.

What to watch

Look for follow-up studies that publish numeric accuracy results per model and per question format, or that extend the 993-item PSM I–style set to other certification bodies and normative domains. A concrete next signal will be papers or benchmarks that quantify the same models' performance on expanded Scrum-topic splits or report remediation strategies for the systematic error types this study identifies.

Model ranking and key notes from the study
Item
Gemini 3 Flash1Highest accuracy; low intra-model variability
GPT-5 mini2Second-highest accuracy; tested with zero-shot, chain-of-thought and source-grounded prompts
DeepSeek Chat 3.23Lowest accuracy among the three; systematic errors like overgeneralization noted
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement