SkillOpt: Microsoft boosts GPT-5.5 with trained Markdown
Microsoft and three Chinese universities trained a Markdown instruction file to tune model behavior and report consistent gains on GPT-5.5.
TL;DR
- 01Microsoft and three Chinese universities trained a Markdown instruction file to tune model behavior and report consistent gains on GPT-5.5.
- 02Microsoft and three Chinese universities unveiled SkillOpt, a method that optimizes instruction documents to change large language model behavior without altering model weights.
- 03SkillOpt reframes the instruction or system prompt as a trainable document rather than a fixed text string.
Microsoft and three Chinese universities unveiled SkillOpt, a method that optimizes instruction documents to change large language model behavior without altering model weights. The team applied SkillOpt to GPT-5.5 and demonstrated consistent improvements on multiple instruction-following and agent-style evaluations by training a single Markdown file used as the model's instruction artifact.
SkillOpt reframes the instruction or system prompt as a trainable document rather than a fixed text string. The method treats a structured Markdown file as the object to be optimized: sections, headings, examples and procedural steps are formalized and adjusted through an automated optimization loop until the desired behaviors emerge. Because the system modifies the instruction document rather than model parameters, the same trained Markdown can be deployed against a vanilla GPT-5.5 instance at runtime.
How SkillOpt works
The core idea is to parameterize a human-readable instruction document and search its space for configurations that produce better outputs. The team starts with a base Markdown instruction that contains explicit task definitions, role framing, and sample interactions. An optimizer then proposes edits to document elements and evaluates the resulting outputs from the target model on a collection of labeled or proxy tasks. Feedback from those evaluations guides further edits.
The approach supports both automated edit proposals and human-in-the-loop adjustments. In experiments, the researchers used batch query evaluations of GPT-5.5 to score candidate Markdown variants on instruction-following fidelity, task success and safety constraints. The final artifact is a trained Markdown file that a developer can include in the model context as the instruction layer, producing improved behavior without fine-tuning model weights.
Because SkillOpt operates at the instruction level, it preserves model integrity and reduces the need for expensive retraining cycles. It also keeps prompts in a human-auditable format, which can be reviewed and edited after optimization. The technique does not require altering the underlying model checkpoints, but it does rely on repeated model queries during the optimization phase.
Benchmark results and limitations
The team reports that applying SkillOpt to GPT-5.5 yielded consistent gains across several internal benchmarks used to measure instruction adherence, multi-step problem solving, and agent-oriented tasks such as web retrieval and tool use. Improvements were most pronounced on tasks that benefit from clearer task decomposition and role framing within the instruction document.
Limitations remain. The optimization process requires many model queries, which can be costly for large models. Gains may also depend on the quality and representativeness of the evaluation tasks used during optimization. The trained Markdown is task specific: a document tuned for one class of tasks may not transfer well to unrelated tasks without further optimization. There are also open questions about whether optimizing for proxy metrics can inadvertently encourage surface-level fixes that do not generalize to real-world user queries.
The research partners named in the release include Microsoft and researchers from Tsinghua University, Peking University and Zhejiang University. The team released examples of trained Markdown files and described workflows for both automated and human-guided refinement, enabling other developers to evaluate SkillOpt-style instruction optimization on their own tasks.
Why it matters
SkillOpt signals a shift toward treating instructions as first-class, trainable artifacts that can be improved without touching model weights, lowering the entry cost to customize large models. That matters for teams that need safer or more reliable behavior quickly, and for auditors who want readable artifacts to inspect changes in model behavior. It also raises practical trade-offs: optimization can be cheaper than retraining but still requires substantial inference budget and careful validation to avoid overfitting to proxy metrics.
| Item | |||
|---|---|---|---|
| Instruction-following score | Baseline | +6 to +12 points (relative improvement reported) | |
| Agent task success rate | Baseline | +5 to +15 percentage points (task dependent) | |
| Safety violations | Observed at baseline rate | Reduced in many test cases, not eliminated | |
| Transfer to unrelated tasks | Limited | Requires re-optimization or tuning |
Primary source
The Decoder
the-decoder.comThe Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Read next
- OpenAI acquires Ona to push Codex toward autonomous codingJun 12 · 3 min read
- OpenAI Academy launches 3 courses to apply AI at workJun 12 · 4 min read
- Agentic AI token costs and per-workflow pricing for agentsJun 8 · 4 min read
- Perplexity launches Search as Code: models write Python pipelinesJun 7 · 4 min read