Open Source AI4 min read

SGPO: Strategy-Guided Policy Optimization for LLM Reasoning

A new method submitted to arXiv on 23 Jun 2026 distills reusable problem-solving strategies into weaker LLMs instead of imitating instance.

The Brieftide

TL;DR

  • 01A new method submitted to arXiv on 23 Jun 2026 distills reusable problem-solving strategies into weaker LLMs instead of imitating instance.
  • 02The authors report experiments across four mathematical benchmarks and two model families, and show SGPO outperforms several baselines.
  • 03SGPO replaces instance-level trajectory imitation with reusable strategy distillation: it extracts structured strategy descriptions from strong-model responses and uses them to guide learning.

Tianyuan Shi and six co-authors submitted a paper titled Beyond Trajectory Imitation: Strategy-Guided Policy Optimization for LLM Reasoning (arXiv:2606.24064) on 23 Jun 2026, proposing SGPO, a distillation framework that teaches weaker language models reusable problem-solving strategies rather than copying instance-level solution trajectories. The authors report experiments across four mathematical benchmarks and two model families, and show SGPO outperforms several baselines.

What is SGPO and how does it differ from trajectory imitation?

SGPO replaces instance-level trajectory imitation with reusable strategy distillation: it extracts structured strategy descriptions from strong-model responses and uses them to guide learning. Instead of transferring a specific step-by-step solution for each instance, SGPO builds both autonomous and strategy-guided trajectories per problem so the model can compare behavior with and without strategic guidance.

The framework reframes distillation from telling the model what to output on a per-instance basis to teaching how to reason by encoding strategies that generalize across problems. This is aimed at reducing memorization of instance-specific steps and improving generalization to novel problems.

How does SGPO distill strategies into weaker models?

SGPO answers two design questions: how to distill, and when to distill. For how to distill, the method uses a token-level forward-KL objective that selectively transfers the distributional shift induced by strategy conditioning into the unguided policy, and it enforces proximal constraints to maintain stability. For when to distill, SGPO applies adaptive instance-level weighting that strengthens strategy guidance when autonomous exploration is weak and reduces guidance as the model's competence grows.

Those two mechanisms together aim to provide a selective, stable signal that nudges an unguided policy toward strategy-conditioned behavior without forcing exact trajectory imitation. The paper also states that the forward-KL objective gives an inherently selective distillation signal that outperforms direct trajectory imitation.

How was SGPO evaluated and what were the results?

The authors evaluated SGPO on four mathematical benchmarks and across two model families, comparing it to supervised fine-tuning (SFT), on-policy reinforcement learning, and hybrid-policy baselines. SGPO consistently outperformed those baselines in their experiments. In particular, SGPO improved the average score by 2.2 points over the strongest baseline on Qwen2.5-7B-Instruct.

The paper highlights that strategy distillation exhibits complementary scaling with base model capability, indicating the method interacts with model size or base competence in measurable ways. The authors attribute part of the improvement to the forward-KL objective's selective signal.

Why it matters

SGPO reframes distillation as teaching transferable reasoning strategies rather than copying solution traces. If the approach scales as the authors report, it could reduce brittle memorization in weaker models and improve generalization to novel reasoning tasks. Researchers and practitioners trying to boost reasoning in smaller or instruction-tuned models may find SGPO a targeted alternative to brute-force trajectory imitation or heavier reinforcement learning procedures.

This matters for teams that need reliable stepwise reasoning from constrained models, because the method specifically aims to transfer "how to reason" rather than instance answers.

What to watch

Look for code, benchmarks, or replication details linked from arXiv:2606.24064 that quantify how SGPO's benefits vary with model family and benchmark. The next milestones will be independent reproductions across additional tasks and public release of strategy-extraction tooling and training recipes that implement token-level forward-KL with proximal constraints.

Authors: Tianyuan Shi, Canbin Huang, Bei Li, Xin Chen, Xiaojun Quan, Jingang Wang, Qifan Wang. Submitted 23 Jun 2026 to arXiv as arXiv:2606.24064.

SGPO versus baselines (summary from paper)
Item
Supervised Fine-Tuning (SFT)Baseline used for comparisonOutperformed by SGPO in experiments
On-policy Reinforcement LearningBaseline used for comparisonOutperformed by SGPO in experiments
Hybrid-policy baselinesBaseline used for comparisonOutperformed by SGPO in experiments
SGPO (Strategy-Guided Policy Optimization)Proposed methodImproves average score by 2.2 points over the strongest baseline on Qwen2.5-7B-Instruct
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement