S1-DeepResearch-32B: state-of-the-art on 20 research benchmarks
S1-DeepResearch-32B synthesizes long-horizon agent trajectories and claims SOTA among open-source models on 20 benchmarks across five.
TL;DR
- 01S1-DeepResearch-32B synthesizes long-horizon agent trajectories and claims SOTA among open-source models on 20 benchmarks across five.
- 02The approach is presented as a unified trajectory construction paradigm.
- 03It produces agent behaviors that include long-chain complex reasoning, report writing, file understanding and generation, and usage of external skills.
S1-DeepResearch-32B is a new long-horizon research agent framework and model submitted to arXiv on 13 Jun 2026 by Yao Dong, Xinglin Xiao, Liwei Dong, Xinlong Jin, Zhengbo Li, Heng Zhang, Duyun Wang and Nan Xu. The paper proposes a unified trajectory construction paradigm for deep research agents and reports that S1-DeepResearch-32B "achieves state-of-the-art performance among open-source models of comparable scale across 20 benchmarks spanning five capability dimensions."
What is S1-DeepResearch and how does it work?
S1-DeepResearch is a framework that combines closed-ended question answering with open-ended exploration to generate long agentic trajectories. The paper defines its core pipeline as "graph-grounded task formulation, agentic trajectory rollout, and multi-dimensional trajectory verification," and uses that pipeline to synthesize high-quality trajectories that emphasize knowledge synthesis, complex reasoning, and planning. The authors argue these trajectories cover capabilities that search-centric datasets miss, including evidence integration, knowledge synthesis, planning, file understanding, and structured report generation.
The approach is presented as a unified trajectory construction paradigm. It produces agent behaviors that include long-chain complex reasoning, report writing, file understanding and generation, and usage of external skills. The paper contrasts this with existing training datasets it describes as mostly search-oriented and focused on closed-ended QA and information localization.
How does the model perform on benchmarks?
S1-DeepResearch-32B, the 32B parameter instantiation described in the paper, is reported to achieve state-of-the-art among open-source models of comparable scale across 20 benchmarks covering five capability dimensions: complex reasoning, instruction following, report generation, file understanding, and skills usage. The authors further state that on several challenging deep research benchmarks the model "approaches the performance of leading proprietary frontier models."
Those are the headline metrics given: a 32B model, evaluation on 20 benchmarks, and coverage across five named capability areas. The paper frames its contribution as both a dataset/trajectory synthesis technique and a model evaluation: the synthesized trajectories place greater emphasis on knowledge synthesis, complex reasoning, and planning than prior search-oriented datasets do.
Why it matters
Training data has driven recent gains in search-oriented agents, but the paper highlights a gap: datasets that focus on closed-ended QA train information-seeking behavior without adequately covering synthesis, long-horizon planning, file-level understanding, or structured report generation. By constructing agentic trajectories that model task formulation, rollout, and multi-dimensional verification jointly, the paper argues researchers can better train agents intended for real-world, research-style workflows. The reported SOTA across 20 benchmarks and the claim of approaching proprietary performance on hard tasks suggest this combined focus can narrow the gap between open-source and frontier proprietary systems.
What to watch
The arXiv entry includes toggles for code and demos (Links to Code, Demos, Hugging Face, Replicate), so the next concrete signals will be whether the authors publish code, model checkpoints, and dataset artifacts on those platforms and whether independent groups reproduce the 20-benchmark results. Community replication and the release of model checkpoints or dataset artifacts will be the clearest confirmations of the paper's claims.
Authors and identifiers: the paper lists authors Yao Dong, Xinglin Xiao, Liwei Dong, Xinlong Jin, Zhengbo Li, Heng Zhang, Duyun Wang and Nan Xu; it was submitted to arXiv as arXiv:2606.15367 on 13 Jun 2026 and includes an arXiv-issued DOI via DataCite (pending registration).
| Item | |||
|---|---|---|---|
| Model size | 32B | comparable scale (paper comparison) | not specified |
| Benchmarks evaluated | 20 benchmarks (paper) | compared by authors | varies |
| Capability dimensions covered | complex reasoning; instruction following; report generation; file understanding; skills usage | less emphasis on synthesis/planning (per paper) | not quantified in paper |
| Performance claim | state-of-the-art among open-source models of comparable scale across 20 benchmarks | outperformed on those 20 benchmarks (per paper) | approached on several challenging deep research benchmarks |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Open Source AIZhipu AI GLM-5.2: 1M-token context, closes gap with Opus 4.8
GLM-5.2 ships under the MIT license with a stable one-million-token context and scores 74.4% on FrontierSWE, one point behind Opus 4.8.
OpenAI: PRC-linked influence operations target US AI debates
OpenAI says PRC-linked campaigns are using AI to push narratives on U.S. tech debates, data centers, tariffs and false ChatGPT claims.
OpenAI: LSEG scales trusted AI, empowers 4,000 staff
LSEG uses OpenAI to scale trusted AI across its global business, accelerating insights, shrinking release cycles and empowering 4.
Industrial policy OpenAI proposes for the Intelligence Age
OpenAI published a people-first industrial policy on June 9, 2026, and opened a pilot grants program with fellowships.