Multimodal AIJuly 2, 20264 min read

Gemini 3 Flash tops GPT-5 mini and DeepSeek Chat 3.2 on Scrum Qs

Three LLMs answered 993 PSM I–style questions; Gemini 3 Flash had highest accuracy, while multi-select and True/False were most error-prone.

The BrieftideJuly 2, 2026

TL;DR

01Three LLMs answered 993 PSM I–style questions; Gemini 3 Flash had highest accuracy, while multi-select and True/False were most error-prone.
02The paper tests accuracy, intra-model stability and error patterns using zero-shot, chain-of-thought and source-grounded prompting, with repeated executions to gauge variability.
03The dataset and setup target certification-style items where strict adherence to normative Scrum definitions, roles, artifacts and rules matters.

Three large language models, GPT-5 mini, Gemini 3 Flash and DeepSeek Chat 3.2, were evaluated on 993 Scrum certification-style questions in a study submitted to arXiv on 29 Jun 2026 (arXiv:2607.00048). The paper tests accuracy, intra-model stability and error patterns using zero-shot, chain-of-thought and source-grounded prompting, with repeated executions to gauge variability.

What did the paper test and how?

The authors evaluated the three contemporary LLMs on 993 questions aligned with the Professional Scrum Master I (PSM I) assessment format, applying three prompting strategies (zero-shot, chain-of-thought and source-grounded) and repeating runs to measure intra-model stability. The dataset and setup target certification-style items where strict adherence to normative Scrum definitions, roles, artifacts and rules matters.

The study's experimental design measured model accuracy across question formats and Scrum topics, and complemented quantitative results with a qualitative analysis of recurring error patterns in incorrect answers. The paper lists eight authors: Robson Alves Vilar, Emanuel Dantas Filho, Ademar França de Sousa Neto, Mirko Perkusich, Danyllo Wagner Albuquerque, João Paiva, Kyller Gorgônio and Angelo Perkusich.

How did the models perform across questions and topics?

Gemini 3 Flash achieved the highest accuracy, GPT-5 mini ranked second, and DeepSeek Chat 3.2 ranked third, while intra-model variability remained low across all prompting conditions. By question format, the models were most accurate on single-answer multiple-choice items and more error-prone on multi-select and True/False questions.

Performance varied by Scrum topic. The models were more consistent in normatively explicit areas such as Artifacts, Empiricism and Product Value, but showed fragility in Scrum Values, Self-Managing Teams, and Stakeholders & Customers. The paper emphasizes that errors were systematic rather than random and identifies recurring patterns including overgeneralization, restrictive wording, compound distractors, and conflicts between common market interpretations and strict Scrum definitions.

Why it matters

Certification-style questions probe precise, normative knowledge rather than loose factual recall, so systematic LLM errors on those items expose where model outputs diverge from formal definitions. That divergence matters for organizations using LLMs for exam preparation, automated assessment or decision support in governed practices, because consistent error patterns are easier to diagnose but harder to mask than random mistakes.

The low intra-model variability reported by the authors indicates reproducible behaviours across repeated executions, which makes both the strengths and the weaknesses of each model predictable in applied settings.

What to watch

Look for follow-up studies that publish numeric accuracy results per model and per question format, or that extend the 993-item PSM I–style set to other certification bodies and normative domains. A concrete next signal will be papers or benchmarks that quantify the same models' performance on expanded Scrum-topic splits or report remediation strategies for the systematic error types this study identifies.

Model ranking and key notes from the study

Item
Gemini 3 Flash	1	Highest accuracy; low intra-model variability
GPT-5 mini	2	Second-highest accuracy; tested with zero-shot, chain-of-thought and source-grounded prompts
DeepSeek Chat 3.2	3	Lowest accuracy among the three; systematic errors like overgeneralization noted

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

MMIR-TCM: multimodal TCM AI framework outperforms GPT-4o, Gemini

MMIR-TCM pairs Memory-SAM, fine-tuned Qwen3-VL and a Qwen3 RAG pipeline.

The BrieftideDAILY BRIEF

MIT Masked IRL: LLMs help robots clarify and ignore cues

MIT’s Masked IRL uses two LLMs to clarify vague prompts, cut demonstration data nearly fivefold.

The BrieftideDAILY BRIEF

Multimodal LLM evaluation: four missing capabilities (2026)

A paper by Po-han Li et al. finds benchmarks miss temporal-spatial coherence, physical-world understanding.

The BrieftideDAILY BRIEF

ReMMD: Multilingual Multi-Image Benchmark and Agent Release

ReMMD introduces ReMMDBench (500 samples, 2,756 images) and ReMMD-Agent; GPT-5.2 yields 41.80% accuracy and 39.12% macro-F1.