NebulaExp-8B post-training pipeline: full-scale ablation
A transparent, ablation-driven post-training recipe for Qwen3-8B-base using 3.84M SFT samples and a 200K RL candidate pool.
TL;DR
- 01A transparent, ablation-driven post-training recipe for Qwen3-8B-base using 3.84M SFT samples and a 200K RL candidate pool.
- 02NebulaExp-8B presents a fully transparent, ablation-driven post-training pipeline built on Qwen3-8B-base, submitted to arXiv on 25 Jun 2026.
- 03NebulaExp is a post-training pipeline for 8B-scale models that splits into two branches, a general instruct model and a complex reasoning-specialized model.
NebulaExp-8B presents a fully transparent, ablation-driven post-training pipeline built on Qwen3-8B-base, submitted to arXiv on 25 Jun 2026. The authors publish a raw corpus of 3.84M multi-source supervised fine-tuning samples and a 200K verifiable reinforcement learning candidate pool, and describe a complete data-processing stack and training recipe.
What is NebulaExp-8B's pipeline?
NebulaExp is a post-training pipeline for 8B-scale models that splits into two branches, a general instruct model and a complex reasoning-specialized model. It starts from Qwen3-8B-base, curates 3.84M multi-source SFT samples and a 200K verifiable RL candidate pool, and applies response distillation, multi-dimensional cross-verification filtering, fine-grained difficulty grading, task classification and diversity-aware sampling.
The authors frame the work as an end-to-end, ablation-driven stack: they run full-scale experiments to measure how stages and dataset choices affect instruction adherence, mathematical reasoning, code generation and general knowledge. The paper is 29 pages and includes 8 figures documenting these experiments.
How much did the pipeline change benchmark scores?
NebulaExp reports concrete gains across both branches: for the Instruct branch, NebulaExp-Ins-SFT raises the average benchmark score from a 55.01 baseline (Qwen3-8B-nothink) to 60.99, and GRPO reinforcement learning further elevates that average to 61.85. For the Reasoning branch, applying medium-difficulty GRPO RL improves average reasoning score from 73.88 to 75.17.
The paper also investigates verifier-free alternatives. A single-teacher OPD approach using only 4K instruction-following samples outperforms the RL baseline by 3.26 points on IFEval and delivers a +4.43 average overall gain. A multi-teacher OPD (MOPD) that fuses four domain-specialist teachers with 10K samples lifts average performance by 4.18 over the base model.
Those numbers map specific interventions to measurable changes: supervised fine-tuning yields a sizeable jump, RL adds a smaller incremental improvement, and small, targeted OPD pools can outperform RL on certain metrics.
Why it matters
NebulaExp makes the dataset construction, filtering rules and training recipes explicit, addressing the paper's stated problem that prior work often withholds those details and thus hinders community reproducibility and lightweight model optimization. The documented gains show that carefully curated SFT corpora plus small, verifier-guided or multi-teacher OPD sets can move scores at 8B scale, not only massive models.
This shifts the conversation from opaque, monolithic RLHF pipelines to reproducible, ablation-tested stacks that researchers and practitioners can inspect and replicate for 8B models.
What to watch
Look for community reproductions of the 3.84M SFT corpus and the 200K RL candidate pool, and for independent benchmarks that validate OPD and MOPD gains on IFEval and the paper's reported average metrics. The next confirmatory milestone will be separate implementations showing similar score deltas for NebulaExp-Ins-SFT, GRPO and OPD methods.
| Item | |||||
|---|---|---|---|---|---|
| Instruct average benchmark score | 55.01 | 60.99 | 61.85 | — | — |
| Reasoning average score | 73.88 | — | — | 73.88 | 75.17 |
| IFEval improvement (OPD single-teacher, 4K) | — | — | — | 3.26 (vs RL baseline) | — |
| Average overall gain (OPD / MOPD) | — | — | — | 4.43 (OPD, 4K) | 4.18 (MOPD, 10K) |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in AI InfrastructureIEEE launches virtual training course on large language models
IEEE is offering a virtual training course that teaches engineers to use large language models as reasoning engines in development.
AI4SE and SE4AI: A decade review of AI in systems engineering
H. Sinan Bank, Daniel R. Herber and Thomas Bradley map three research phases and assess 1.
Hyperscalers AI spending to outpace cash flow by Q3 2026
Epoch AI data shows infrastructure spending growing ~70% annually versus operating cash flow at ~23%, with a crossover around Q3 2026.
DeepInsight: Unified evaluation for the Physical AI stack
DeepInsight provides a single runtime and three invariants to run and diagnose benchmarks across LLMs.