AI Infrastructure4 min read

NebulaExp-8B post-training pipeline: full-scale ablation

A transparent, ablation-driven post-training recipe for Qwen3-8B-base using 3.84M SFT samples and a 200K RL candidate pool.

The Brieftide

TL;DR

  • 01A transparent, ablation-driven post-training recipe for Qwen3-8B-base using 3.84M SFT samples and a 200K RL candidate pool.
  • 02NebulaExp-8B presents a fully transparent, ablation-driven post-training pipeline built on Qwen3-8B-base, submitted to arXiv on 25 Jun 2026.
  • 03NebulaExp is a post-training pipeline for 8B-scale models that splits into two branches, a general instruct model and a complex reasoning-specialized model.

NebulaExp-8B presents a fully transparent, ablation-driven post-training pipeline built on Qwen3-8B-base, submitted to arXiv on 25 Jun 2026. The authors publish a raw corpus of 3.84M multi-source supervised fine-tuning samples and a 200K verifiable reinforcement learning candidate pool, and describe a complete data-processing stack and training recipe.

What is NebulaExp-8B's pipeline?

NebulaExp is a post-training pipeline for 8B-scale models that splits into two branches, a general instruct model and a complex reasoning-specialized model. It starts from Qwen3-8B-base, curates 3.84M multi-source SFT samples and a 200K verifiable RL candidate pool, and applies response distillation, multi-dimensional cross-verification filtering, fine-grained difficulty grading, task classification and diversity-aware sampling.

The authors frame the work as an end-to-end, ablation-driven stack: they run full-scale experiments to measure how stages and dataset choices affect instruction adherence, mathematical reasoning, code generation and general knowledge. The paper is 29 pages and includes 8 figures documenting these experiments.

How much did the pipeline change benchmark scores?

NebulaExp reports concrete gains across both branches: for the Instruct branch, NebulaExp-Ins-SFT raises the average benchmark score from a 55.01 baseline (Qwen3-8B-nothink) to 60.99, and GRPO reinforcement learning further elevates that average to 61.85. For the Reasoning branch, applying medium-difficulty GRPO RL improves average reasoning score from 73.88 to 75.17.

The paper also investigates verifier-free alternatives. A single-teacher OPD approach using only 4K instruction-following samples outperforms the RL baseline by 3.26 points on IFEval and delivers a +4.43 average overall gain. A multi-teacher OPD (MOPD) that fuses four domain-specialist teachers with 10K samples lifts average performance by 4.18 over the base model.

Those numbers map specific interventions to measurable changes: supervised fine-tuning yields a sizeable jump, RL adds a smaller incremental improvement, and small, targeted OPD pools can outperform RL on certain metrics.

Why it matters

NebulaExp makes the dataset construction, filtering rules and training recipes explicit, addressing the paper's stated problem that prior work often withholds those details and thus hinders community reproducibility and lightweight model optimization. The documented gains show that carefully curated SFT corpora plus small, verifier-guided or multi-teacher OPD sets can move scores at 8B scale, not only massive models.

This shifts the conversation from opaque, monolithic RLHF pipelines to reproducible, ablation-tested stacks that researchers and practitioners can inspect and replicate for 8B models.

What to watch

Look for community reproductions of the 3.84M SFT corpus and the 200K RL candidate pool, and for independent benchmarks that validate OPD and MOPD gains on IFEval and the paper's reported average metrics. The next confirmatory milestone will be separate implementations showing similar score deltas for NebulaExp-Ins-SFT, GRPO and OPD methods.

Reported scores and gains from NebulaExp-8B
Item
Instruct average benchmark score55.0160.9961.85
Reasoning average score73.8873.8875.17
IFEval improvement (OPD single-teacher, 4K)3.26 (vs RL baseline)
Average overall gain (OPD / MOPD)4.43 (OPD, 4K)4.18 (MOPD, 10K)
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement