SciRisk-Bench: Benchmarking AI4Science safety across 10 risks
SciRisk-Bench evaluates mainstream and science-oriented LLMs across 7 disciplines, 31 subdisciplines and 10 explicit risk dimensions.
TL;DR
- 01SciRisk-Bench evaluates mainstream and science-oriented LLMs across 7 disciplines, 31 subdisciplines and 10 explicit risk dimensions.
- 02SciRisk-Bench is a new benchmark for AI4Science safety, submitted to arXiv on 17 Jun 2026 as arXiv:2606.18936 by Linghao Feng and 10 coauthors.
- 03The benchmark measures both scientific competence and whether models recognize and avoid risks across 7 disciplines, 31 subdisciplines and 10 risk dimensions.
SciRisk-Bench is a new benchmark for AI4Science safety, submitted to arXiv on 17 Jun 2026 as arXiv:2606.18936 by Linghao Feng and 10 coauthors. The benchmark measures both scientific competence and whether models recognize and avoid risks across 7 disciplines, 31 subdisciplines and 10 risk dimensions.
What is SciRisk-Bench and what does it measure?
SciRisk-Bench is a risk-dimension-aware benchmark that evaluates AI4Science safety from two complementary perspectives: explicit risk dimensions and scientific disciplines. The paper says the benchmark covers 7 disciplines, 31 subdisciplines and 10 risk dimensions. It is designed to test not only task performance but also model behavior in high-stakes scientific contexts, including whether models recognize and avoid risks.
How was the benchmark organized and who made it?
The authors, led by Linghao Feng with coauthors Yinqian Sun, Dongqi Liang, Sicheng Shen, Chenfei Yan, Yuxuan Peng, Yilin Zhao, Haibo Tong, Kai Li, FeiFei Zhao and Yi Zeng, built SciRisk-Bench to span multiple scientific areas and explicit risk categories. The submission metadata on arXiv lists the title SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety and identifies the work as submitted on 17 Jun 2026 (arXiv:2606.18936). The paper frames the benchmark around two organizing principles: explicit risk dimensions and scientific disciplines, enabling cross-cutting evaluation across both axes.
How were models evaluated on SciRisk-Bench?
The experimental section of the paper evaluates both mainstream LLMs and science-oriented LLMs across the benchmark's risk dimensions, disciplines and subdisciplines, enabling fine-grained diagnosis of unsafe behaviors. The authors state that the benchmark enables analysis "across risk dimensions, disciplines, and sub-disciplines," which supports detailed breakdowns of where models perform safely or unsafely. The submission includes links to the paper PDF and TeX source on arXiv and lists ancillary metadata such as an arXiv-issued DOI via DataCite (pending registration).
Why it matters
AI models are being embedded into AI4Science workflows for literature analysis, laboratory planning and autonomous discovery, according to the paper. A benchmark that explicitly measures whether models recognize and avoid risks addresses a gap the authors identify in existing AI4Science safety datasets, which they say leave underlying risk dimensions underspecified. By codifying 10 risk dimensions and mapping them across 7 disciplines and 31 subdisciplines, SciRisk-Bench provides a structured way to find where scientific LLMs remain unsafe.
What to watch
Look for the paper's experimental results and any associated code or data releases linked from the arXiv entry; the paper's experimental section is the immediate next signal, since it evaluates mainstream and science-oriented LLMs across the benchmark's axes. Also watch for follow-up work that applies SciRisk-Bench across more model families or that expands the 7-disciplines / 31-subdiscipline matrix.
References and source details: the work appears on arXiv as arXiv:2606.18936, submitted 17 Jun 2026, by Linghao Feng et al.; the abstract and submission metadata state the benchmark covers 7 disciplines, 31 subdisciplines and 10 risk dimensions and evaluates both mainstream and science-oriented LLMs.
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in AI SafetyDario Amodei's AI playbook: Anthropic's regulation plan
Amodei urges binding third-party audits, federal power to block risky models, export controls.
Germany approves DE-AISI, an AI security institute based on UK
The National Security Council authorised a German AI Security Institute to test advanced models.
Google DeepMind launches $10M multi-agent AI safety fund
A global call for proposals offers up to $10M to study group behaviours of interacting AI agents, backed by Schmidt Sciences.
OpenAI backs away from full automation, aims 'tandem' by 2028
Sam Altman and Jakub Pachocki say AI should work in 'tandem' with humans and propose an international body to slow frontier development.