Benchmarks & EvalsOctober 5, 20254 min read

LLM evaluation: 4 methods including MMLU and verifiers

Sebastian Raschka lays out four evaluation approaches—multiple-choice (MMLU), verifiers, leaderboards.

The BrieftideOctober 5, 2025

TL;DR

01Sebastian Raschka lays out four evaluation approaches—multiple-choice (MMLU), verifiers, leaderboards.
02Sebastian Raschka presents four main approaches to LLM evaluation in a long article published Oct 05, 2025: multiple‑choice benchmarks, verifiers, leaderboards, and LLM judges.
03The article frames the four categories into two groups, benchmark‑based evaluation and judgment‑based evaluation, and includes from‑scratch code examples (including a Qwen3 0.6B implementation).

Sebastian Raschka presents four main approaches to LLM evaluation in a long article published Oct 05, 2025: multiple‑choice benchmarks, verifiers, leaderboards, and LLM judges. The article frames the four categories into two groups, benchmark‑based evaluation and judgment‑based evaluation, and includes from‑scratch code examples (including a Qwen3 0.6B implementation).

The four approaches, at a glance

Multiple‑choice benchmarks measure answer‑choice accuracy by comparing a model's predicted letter to the dataset's correct answer. Raschka uses MMLU as a representative example, noting MMLU consists of 57 subjects and about 16 thousand multiple‑choice questions in total, and that performance is measured in terms of accuracy (the fraction of correctly answered questions). He shows a sample MMLU prompt format that ends with "Answer: " to encourage a single‑letter next token, and demonstrates a scoring routine that extracts the first A/B/C/D letter the model prints. In one example run on the high_school_mathematics subset a generated response produced "Generated letter: C" and "Correct? False". Raschka also notes that, assuming equal answer probability, a random guesser is expected to achieve 25%.

Verifiers are presented as a separate method and are a focus of Raschka's upcoming book Build a Reasoning Model (From Scratch). The book takes a hands‑on approach to building a reasoning LLM and, Raschka says, focuses more on verifier‑based evaluation than this article does. The article positions verifiers as one of the four practical evaluation patterns practitioners use, and ties verifier discussion to the author's broader work on building and validating reasoning models.

Leaderboards and model cards are the third approach Raschka lists, the sort of public comparisons that research papers, marketing materials, technical reports, and model cards commonly include. He treats these as another standard evaluation channel used in practice, alongside multiple‑choice and verifier methods.

LLM judges, the fourth approach, are described as part of the judgment‑based evaluation group. The article groups LLM judges with verifiers conceptually under judgment‑based approaches while contrasting them with benchmark‑based approaches such as multiple‑choice datasets.

Example: a from‑scratch Qwen3 0.6B demonstration

Raschka walks through code that loads a from‑scratch Qwen3 0.6B model implemented in pure PyTorch. He notes the small Qwen3 implementation requires only about 1.5 GB of RAM. The example shows how to format MMLU prompts, tokenize and tensorize them, and then generate and parse token output to pull out the model's chosen answer letter. The article includes code blocks for loading the model, formatting prompts, and a predict_choice function that iterates generated tokens to find the first A/B/C/D letter.

Raschka also points readers to additional code on GitHub and to related material, including an earlier from‑scratch Qwen3 walkthrough. He mentions Build a Reasoning Model (From Scratch) is in early access with more than 100 pages already online and another 30 pages being added by the layout team.

Why it matters

Distinguishing these four approaches clarifies what an evaluation number actually measures: multiple‑choice accuracy quantifies knowledge recall on a fixed set of questions, while verifiers and LLM judges encode different, often judgmental, criteria that do not reduce to a single standardized metric. That distinction matters when choosing models, designing fine‑tuning workflows, or interpreting reported results in papers and model cards. Raschka's inclusion of runnable, from‑scratch examples makes the tradeoffs tangible: the same problem framing (prompt format, scoring logic) can materially change measured performance.

What to watch

Look for the author's follow‑up material and code updates tied to Build a Reasoning Model (From Scratch) and the linked GitHub repositories, where additional verifier‑based evaluation examples and expanded from‑scratch implementations are being added.

Comparison of the four LLM evaluation approaches from Raschka

Item
Multiple-choice benchmarks	Answer-choice accuracy	MMLU (57 subjects, about 16 thousand questions)	Accuracy (fraction correct); random guess ~25%
Verifiers	Verifier-based evaluation	Focus of Build a Reasoning Model (From Scratch)	Varies by verifier
Leaderboards	Public comparative rankings	Used in research papers, model cards, technical reports	Varies by leaderboard
LLM judges	Judgment-based evaluation	Grouped with verifiers as judgment-based	Varies by judge criteria

Written by The Brieftide · Source: Ahead of AI

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

SWE-Explore benchmark: AI coding agents find files but miss lines

SWE-Explore shows Claude Code, Codex and peers usually locate the correct source file but fail to return the exact lines needed for fixes.

Hugging FaceFRONTIER LAB

olmo-eval: AllenAI launches evaluation workbench for model

Open-source olmo-eval bundles dataset connectors, metric runners and reproducible evaluation for iterative model testing.

The DecoderNEWSLETTER

Claude Fable 5 benchmark: SWE-bench 95% but costly, filtered

Anthropic's Mythos Claude Fable 5 scores 95% on SWE-bench Verified, tops public tests but adds heavy safety filters and higher cost.

The DecoderNEWSLETTER

Anthropic releases Claude Fable 5 and Mythos 5 with coding gains

Anthropic says the new Claude Fable 5 and Mythos 5 outperform the Opus generation on coding and scientific benchmarks.