LLM evaluation: 4 methods including MMLU and verifiers
Sebastian Raschka lays out four evaluation approaches—multiple-choice (MMLU), verifiers, leaderboards.
TL;DR
- 01Sebastian Raschka lays out four evaluation approaches—multiple-choice (MMLU), verifiers, leaderboards.
- 02Sebastian Raschka presents four main approaches to LLM evaluation in a long article published Oct 05, 2025: multiple‑choice benchmarks, verifiers, leaderboards, and LLM judges.
- 03The article frames the four categories into two groups, benchmark‑based evaluation and judgment‑based evaluation, and includes from‑scratch code examples (including a Qwen3 0.6B implementation).
Sebastian Raschka presents four main approaches to LLM evaluation in a long article published Oct 05, 2025: multiple‑choice benchmarks, verifiers, leaderboards, and LLM judges. The article frames the four categories into two groups, benchmark‑based evaluation and judgment‑based evaluation, and includes from‑scratch code examples (including a Qwen3 0.6B implementation).
The four approaches, at a glance
Multiple‑choice benchmarks measure answer‑choice accuracy by comparing a model's predicted letter to the dataset's correct answer. Raschka uses MMLU as a representative example, noting MMLU consists of 57 subjects and about 16 thousand multiple‑choice questions in total, and that performance is measured in terms of accuracy (the fraction of correctly answered questions). He shows a sample MMLU prompt format that ends with "Answer: " to encourage a single‑letter next token, and demonstrates a scoring routine that extracts the first A/B/C/D letter the model prints. In one example run on the high_school_mathematics subset a generated response produced "Generated letter: C" and "Correct? False". Raschka also notes that, assuming equal answer probability, a random guesser is expected to achieve 25%.
Verifiers are presented as a separate method and are a focus of Raschka's upcoming book Build a Reasoning Model (From Scratch). The book takes a hands‑on approach to building a reasoning LLM and, Raschka says, focuses more on verifier‑based evaluation than this article does. The article positions verifiers as one of the four practical evaluation patterns practitioners use, and ties verifier discussion to the author's broader work on building and validating reasoning models.
Leaderboards and model cards are the third approach Raschka lists, the sort of public comparisons that research papers, marketing materials, technical reports, and model cards commonly include. He treats these as another standard evaluation channel used in practice, alongside multiple‑choice and verifier methods.
LLM judges, the fourth approach, are described as part of the judgment‑based evaluation group. The article groups LLM judges with verifiers conceptually under judgment‑based approaches while contrasting them with benchmark‑based approaches such as multiple‑choice datasets.
Example: a from‑scratch Qwen3 0.6B demonstration
Raschka walks through code that loads a from‑scratch Qwen3 0.6B model implemented in pure PyTorch. He notes the small Qwen3 implementation requires only about 1.5 GB of RAM. The example shows how to format MMLU prompts, tokenize and tensorize them, and then generate and parse token output to pull out the model's chosen answer letter. The article includes code blocks for loading the model, formatting prompts, and a predict_choice function that iterates generated tokens to find the first A/B/C/D letter.
Raschka also points readers to additional code on GitHub and to related material, including an earlier from‑scratch Qwen3 walkthrough. He mentions Build a Reasoning Model (From Scratch) is in early access with more than 100 pages already online and another 30 pages being added by the layout team.
Why it matters
Distinguishing these four approaches clarifies what an evaluation number actually measures: multiple‑choice accuracy quantifies knowledge recall on a fixed set of questions, while verifiers and LLM judges encode different, often judgmental, criteria that do not reduce to a single standardized metric. That distinction matters when choosing models, designing fine‑tuning workflows, or interpreting reported results in papers and model cards. Raschka's inclusion of runnable, from‑scratch examples makes the tradeoffs tangible: the same problem framing (prompt format, scoring logic) can materially change measured performance.
What to watch
Look for the author's follow‑up material and code updates tied to Build a Reasoning Model (From Scratch) and the linked GitHub repositories, where additional verifier‑based evaluation examples and expanded from‑scratch implementations are being added.
| Item | ||||
|---|---|---|---|---|
| Multiple-choice benchmarks | Answer-choice accuracy | MMLU (57 subjects, about 16 thousand questions) | Accuracy (fraction correct); random guess ~25% | |
| Verifiers | Verifier-based evaluation | Focus of Build a Reasoning Model (From Scratch) | Varies by verifier | |
| Leaderboards | Public comparative rankings | Used in research papers, model cards, technical reports | Varies by leaderboard | |
| LLM judges | Judgment-based evaluation | Grouped with verifiers as judgment-based | Varies by judge criteria |
Written by The Brieftide · Source: Ahead of AI
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Benchmarks & EvalsSWE-Explore benchmark: AI coding agents find files but miss lines
SWE-Explore shows Claude Code, Codex and peers usually locate the correct source file but fail to return the exact lines needed for fixes.
olmo-eval: AllenAI launches evaluation workbench for model
Open-source olmo-eval bundles dataset connectors, metric runners and reproducible evaluation for iterative model testing.
Claude Fable 5 benchmark: SWE-bench 95% but costly, filtered
Anthropic's Mythos Claude Fable 5 scores 95% on SWE-bench Verified, tops public tests but adds heavy safety filters and higher cost.
Anthropic releases Claude Fable 5 and Mythos 5 with coding gains
Anthropic says the new Claude Fable 5 and Mythos 5 outperform the Opus generation on coding and scientific benchmarks.