Benchmarks & EvalsDecember 9, 20253 min read

DeepMind FACTS Benchmark Suite: evaluating LLM factuality

DeepMind released the FACTS Benchmark Suite to measure and categorize factual errors in large language models with standardized tests.

The BrieftideDecember 9, 2025

TL;DR

01DeepMind released the FACTS Benchmark Suite to measure and categorize factual errors in large language models with standardized tests.
02DeepMind released the FACTS Benchmark Suite today, a standardized set of tests and datasets designed to measure factuality in large language models.
03The suite bundles task templates, reference annotations and an evaluation harness intended to surface specific error modes such as fabrications, temporal inconsistencies and misattribution.

DeepMind released the FACTS Benchmark Suite today, a standardized set of tests and datasets designed to measure factuality in large language models. The suite bundles task templates, reference annotations and an evaluation harness intended to surface specific error modes such as fabrications, temporal inconsistencies and misattribution.

The FACTS release targets systematic assessment rather than single-number scores. It groups checks by error type, provides controlled prompt templates and includes both automatic checks and human-annotated references to validate model outputs. DeepMind positions FACTS as a diagnostic complement to existing benchmarks, focusing on where models produce incorrect or unverifiable statements and how those failures vary by prompt and context.

What the suite contains

FACTS is organized around a small set of core measurement dimensions. Each dimension contains multiple task templates and example prompts to exercise the same underlying capability in varied ways. The public release includes the following elements:

Task templates and prompt families, so evaluators can rerun tests with consistent input structure.
Reference answers and human annotations for a subset of items, enabling calibration of automatic metrics.
An evaluation harness that computes per-item metrics and aggregates by error type rather than producing a single aggregate score.
Guidance on reproducible evaluation, including random-seed recommendations and instructions for handling model nondeterminism.

DeepMind emphasizes diagnostic clarity: users can trace a low score to a particular failure mode, for example confusion about dates, invented entities, or incorrect source attribution. The suite also includes adversarial-style prompt variants to test robustness under phrasing changes.

How FACTS differs from prior benchmarks

FACTS narrows focus to factuality rather than broad abilities. Where prior benchmarks often collapse performance into a single accuracy number across disparate tasks, FACTS classifies errors and measures them with multiple targeted probes. The result is an output that is easier to interpret operationally: developers can ask which kinds of factual mistakes a model makes most often and which prompt styles increase risk.

The suite does not mandate a single metric. Instead it reports multiple complementary signals, including exact-match where applicable, normalized factuality scores for free-form outputs, and human disagreement rates. That design reflects DeepMind's stated goal of improving measurement fidelity for factual errors that are consequential in real-world use, such as hallucinated citations or stale knowledge about dates and events.

Early adopters can plug FACTS into existing evaluation pipelines. DeepMind included example scripts for running tests against hosted APIs and open-source checkpoints, and documented procedures for sampling human validators when automation is insufficient.

Why it matters

FACTS supplies practitioners with a more granular toolkit for locating and quantifying factual failures in LLMs, which helps prioritize mitigation work such as retrieval augmentation or post-generation verification. By shifting evaluation toward error taxonomy and reproducible workflows, the suite could change how teams compare model safety and reliability across development cycles.

FACTS measurement dimensions

Item
Knowledge recall	Checks factual retrieval of stable facts	Direct Q&A, cloze
Temporal consistency	Correctness vs. time-sensitive facts	Scenario updates, date-shifted prompts
Attribution and sourcing	Whether outputs correctly cite or reference sources	Citation generation, source-verification tasks
Hallucination / fabrication	Detection of invented entities or events	Open-ended generation with fact-check targets
Prompt robustness	How phrasing changes affect factual output	Paraphrases, adversarial prompt variants

Primary source

Google DeepMind

deepmind.google

Read the original

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

SWE-Explore benchmark: AI coding agents find files but miss lines

SWE-Explore shows Claude Code, Codex and peers usually locate the correct source file but fail to return the exact lines needed for fixes.

Hugging FaceFRONTIER LAB

olmo-eval: AllenAI launches evaluation workbench for model

Open-source olmo-eval bundles dataset connectors, metric runners and reproducible evaluation for iterative model testing.

The DecoderNEWSLETTER

Claude Fable 5 benchmark: SWE-bench 95% but costly, filtered

Anthropic's Mythos Claude Fable 5 scores 95% on SWE-bench Verified, tops public tests but adds heavy safety filters and higher cost.

The DecoderNEWSLETTER

Anthropic releases Claude Fable 5 and Mythos 5 with coding gains

Anthropic says the new Claude Fable 5 and Mythos 5 outperform the Opus generation on coding and scientific benchmarks.