SGPO: Strategy-Guided Policy Optimization for LLM Reasoning
A new method submitted to arXiv on 23 Jun 2026 distills reusable problem-solving strategies into weaker LLMs instead of imitating instance.
TL;DR
- 01A new method submitted to arXiv on 23 Jun 2026 distills reusable problem-solving strategies into weaker LLMs instead of imitating instance.
- 02The authors report experiments across four mathematical benchmarks and two model families, and show SGPO outperforms several baselines.
- 03SGPO replaces instance-level trajectory imitation with reusable strategy distillation: it extracts structured strategy descriptions from strong-model responses and uses them to guide learning.
Tianyuan Shi and six co-authors submitted a paper titled Beyond Trajectory Imitation: Strategy-Guided Policy Optimization for LLM Reasoning (arXiv:2606.24064) on 23 Jun 2026, proposing SGPO, a distillation framework that teaches weaker language models reusable problem-solving strategies rather than copying instance-level solution trajectories. The authors report experiments across four mathematical benchmarks and two model families, and show SGPO outperforms several baselines.
What is SGPO and how does it differ from trajectory imitation?
SGPO replaces instance-level trajectory imitation with reusable strategy distillation: it extracts structured strategy descriptions from strong-model responses and uses them to guide learning. Instead of transferring a specific step-by-step solution for each instance, SGPO builds both autonomous and strategy-guided trajectories per problem so the model can compare behavior with and without strategic guidance.
The framework reframes distillation from telling the model what to output on a per-instance basis to teaching how to reason by encoding strategies that generalize across problems. This is aimed at reducing memorization of instance-specific steps and improving generalization to novel problems.
How does SGPO distill strategies into weaker models?
SGPO answers two design questions: how to distill, and when to distill. For how to distill, the method uses a token-level forward-KL objective that selectively transfers the distributional shift induced by strategy conditioning into the unguided policy, and it enforces proximal constraints to maintain stability. For when to distill, SGPO applies adaptive instance-level weighting that strengthens strategy guidance when autonomous exploration is weak and reduces guidance as the model's competence grows.
Those two mechanisms together aim to provide a selective, stable signal that nudges an unguided policy toward strategy-conditioned behavior without forcing exact trajectory imitation. The paper also states that the forward-KL objective gives an inherently selective distillation signal that outperforms direct trajectory imitation.
How was SGPO evaluated and what were the results?
The authors evaluated SGPO on four mathematical benchmarks and across two model families, comparing it to supervised fine-tuning (SFT), on-policy reinforcement learning, and hybrid-policy baselines. SGPO consistently outperformed those baselines in their experiments. In particular, SGPO improved the average score by 2.2 points over the strongest baseline on Qwen2.5-7B-Instruct.
The paper highlights that strategy distillation exhibits complementary scaling with base model capability, indicating the method interacts with model size or base competence in measurable ways. The authors attribute part of the improvement to the forward-KL objective's selective signal.
Why it matters
SGPO reframes distillation as teaching transferable reasoning strategies rather than copying solution traces. If the approach scales as the authors report, it could reduce brittle memorization in weaker models and improve generalization to novel reasoning tasks. Researchers and practitioners trying to boost reasoning in smaller or instruction-tuned models may find SGPO a targeted alternative to brute-force trajectory imitation or heavier reinforcement learning procedures.
This matters for teams that need reliable stepwise reasoning from constrained models, because the method specifically aims to transfer "how to reason" rather than instance answers.
What to watch
Look for code, benchmarks, or replication details linked from arXiv:2606.24064 that quantify how SGPO's benefits vary with model family and benchmark. The next milestones will be independent reproductions across additional tasks and public release of strategy-extraction tooling and training recipes that implement token-level forward-KL with proximal constraints.
Authors: Tianyuan Shi, Canbin Huang, Bei Li, Xin Chen, Xiaojun Quan, Jingang Wang, Qifan Wang. Submitted 23 Jun 2026 to arXiv as arXiv:2606.24064.
| Item | ||
|---|---|---|
| Supervised Fine-Tuning (SFT) | Baseline used for comparison | Outperformed by SGPO in experiments |
| On-policy Reinforcement Learning | Baseline used for comparison | Outperformed by SGPO in experiments |
| Hybrid-policy baselines | Baseline used for comparison | Outperformed by SGPO in experiments |
| SGPO (Strategy-Guided Policy Optimization) | Proposed method | Improves average score by 2.2 points over the strongest baseline on Qwen2.5-7B-Instruct |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Open Source AIOpenAI joins Appia Foundation to build shared AI standards
OpenAI supports evaluation frameworks, safety practices and global cooperation through the Appia Foundation.
Zhipu AI GLM-5.2: 1M-token context, closes gap with Opus 4.8
GLM-5.2 ships under the MIT license with a stable one-million-token context and scores 74.4% on FrontierSWE, one point behind Opus 4.8.
OpenAI: PRC-linked influence operations target US AI debates
OpenAI says PRC-linked campaigns are using AI to push narratives on U.S. tech debates, data centers, tariffs and false ChatGPT claims.
OpenAI: LSEG scales trusted AI, empowers 4,000 staff
LSEG uses OpenAI to scale trusted AI across its global business, accelerating insights, shrinking release cycles and empowering 4.