X+Slides: Audience-Conditioned Slide Benchmark and Results
X+Slides evaluates slide generation across 113 topics, seven presentation scenes and 8,133 audience-weighted, source-grounded probes.
TL;DR
- 01X+Slides evaluates slide generation across 113 topics, seven presentation scenes and 8,133 audience-weighted, source-grounded probes.
- 02The paper defines four evaluation metrics and reports model scores at tauA=0.7 for DeepPresenter, SlideTailor and a NotebookLM ablation.
- 03The benchmark focuses on audience utility rather than only slide completeness or visual quality.
X+Slides, a benchmark submitted to arXiv on 17 Jun 2026 by Haodong Chen and eight co-authors, evaluates audience-conditioned slide generation across 113 topics, seven presentation scenes and 8,133 deduplicated, source-grounded probes. The paper defines four evaluation metrics and reports model scores at tau_A=0.7 for DeepPresenter, SlideTailor and a NotebookLM ablation.
What is X+Slides and how is it built?
X+Slides is a benchmark that measures how well generated slide decks match the information needs of different audiences, built from a corpus spanning 113 topics, seven presentation scenes and 8,133 deduplicated, source-grounded probes. The authors assign audience-specific utility weights to the same probes and compute four complementary metrics: Audience Coverage, Domain-wise Coverage, Efficiency and Correctness, with the evaluation framework described as dynamic and source-grounded.
The benchmark focuses on audience utility rather than only slide completeness or visual quality. Probes are grounded to original sources, which lets X+Slides verify whether slide claims are supported by the source material rather than inferred or hallucinated content.
How do current slide-generation systems perform on X+Slides?
At tau_A=0.7 the benchmark shows varied performance: DeepPresenter reaches an Audience Coverage of 0.714, SlideTailor reaches 0.594 and the NotebookLM ablation reaches 0.853. Those scores indicate that, as the paper puts it, "current systems can recover a substantial but still incomplete part of audience-essential information."
Beyond a single metric, X+Slides reports Domain-wise Coverage to show which information types are covered, Efficiency to measure delivered utility per unit of attention cost, and Correctness to verify grounding to the source. The authors use these complementary measures to highlight differences that visual quality or broad topical coverage alone would not reveal; they note that visual quality and broad topic coverage should not be treated as evidence support without source-grounded evaluation.
Why does audience conditioning change evaluation?
Audience conditioning forces evaluation to weight the same source content differently depending on who will view the slides, for example specialists needing rigorous proofs versus decision-makers seeking actionable conclusions, and X+Slides operationalizes that by assigning audience-specific utility weights to probes. This turns slide generation from a one-size-fits-all task into a conditional task where the same source-grounded probe can carry different importance, producing different utility scores and trade-offs in Coverage, Efficiency and Correctness.
That design exposes where systems prioritize visual or topical breadth over source-backed utility. The paper’s experiments show concrete numeric gaps at tau_A=0.7 across models, which the authors use to argue for source-grounded, audience-aware evaluation rather than relying solely on perceived visual quality or generic coverage metrics.
What to watch
Follow whether future evaluations adopt audience-weighted, source-grounded probes and whether developers publish more ground-truth grounded datasets or ablations comparable to the NotebookLM result. A concrete next signal will be external reproductions reporting Audience Coverage at tau_A=0.7 for other systems, or extensions of X+Slides that add new presentation scenes or audience types.
| Item | ||
|---|---|---|
| Topics covered | 113 | |
| Presentation scenes | seven | |
| Deduplicated, source-grounded probes | 8,133 | |
| Audience Coverage, DeepPresenter (tau_A=0.7) | 71 | |
| Audience Coverage, SlideTailor (tau_A=0.7) | 59 | |
| Audience Coverage, NotebookLM ablation (tau_A=0.7) | 85 |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Benchmarks & EvalsBIM-Edit: Benchmarking LLMs for IFC-based BIM Editing
BIM-Edit evaluates LLMs on 324 IFC editing tasks across 11 real models and 36 synthetic scenes; the top model averages 49.5%.
ORAgentBench benchmark: LLM agents on 107 OR tasks pass rates
ORAgentBench packages 107 execution-grounded operations research tasks; best agent passed 35.51% overall and 20.59% of hard tasks.
LLM Agents: Predictive Validity vs Static Leaderboards
Dhaval C. Patel et al. aggregate fourteen implementation studies and seven prior benchmarks and propose ranking by predictive validity.
CombEval: Benchmarking combinatorial counting in 11 LLMs
CombEval is a dynamic, solver-verified benchmark for combinatorial counting that tests 11 LLMs across varied object types.