Foundation Models5 min read

Large Language Models: Small Initialization Improves Reasoning

An arXiv paper (submitted 16 Jun 2026) shows smaller parameter initialization scales boost pretraining, especially on reasoning tasks.

The Brieftide

TL;DR

  • 01An arXiv paper (submitted 16 Jun 2026) shows smaller parameter initialization scales boost pretraining, especially on reasoning tasks.
  • 02Small initialization of parameters improves pretraining for large language models.
  • 03The paper finds that smaller parameter initialization scales consistently improve pretraining outcomes and strengthen reasoning abilities in LLMs.

Small initialization of parameters improves pretraining for large language models. The paper "Small Initialization Matters for Large Language Models," submitted 16 Jun 2026 to arXiv (arXiv:2606.17945) by Liangkai Hang, Junjie Yao, Zhiyu Li, Feiyu Xiong, Hongkang Yang and Zhi-Qin John Xu, finds that reducing the initialization scale consistently improves pretraining, with the largest gains on reasoning-demanding tasks.

What did the authors find?

The paper finds that smaller parameter initialization scales consistently improve pretraining outcomes and strengthen reasoning abilities in LLMs. The authors report that the improvements concentrate on "non-trivial, context-constrained predictions" rather than uniformly across all tokens, identify a "critical initialization" that balances reasoning and training, and propose a simple γ-initialization rule that makes initialization an explicit knob and recommends using small initialization by default. The submission runs 26 pages and includes 8 figures.

Beyond the headline, the paper documents two empirical settings commonly used in practice that can limit the advantage of small initialization, and shows that relaxing those settings restores favorable scaling for small initialization. The authors emphasise that the largest gains appear on reasoning-demanding tasks.

How does small initialization change training dynamics?

Small initialization drives a distinct developmental trajectory in parameter dynamics: parameters first condense into low-complexity structures and later expand into richer representations. That condensed-then-expanded progression gives concrete form to what the authors call the idea that "compression is intelligence." Token-level analyses in the paper show that this trajectory is tied to improved predictions for tokens that require contextual reasoning, rather than producing uniform accuracy gains across every token.

The paper also locates a "critical initialization" regime, which the authors describe as a balance point between optimizing for reasoning capability and maintaining efficient training. They argue for exposing initialization as a tunable parameter through their γ-initialization rule so practitioners can choose smaller initial scales as a low-cost intervention to boost reasoning across model scales.

Why does this matter?

Smaller initialization is an almost cost-free change to model setup that the authors tie directly to improved reasoning behavior in LLMs. If small initialization indeed concentrates gains on context-dependent, harder predictions, then model builders can improve reasoning performance without changing scale, data, or architecture. The paper frames initialization not as an incidental engineering choice but as a gene-like determinant of training and model capacity, shifting how teams might prioritize hyperparameter search during pretraining.

What are the limits and caveats?

The authors note that two widely used empirical settings can restrain the advantage of small initialization, and that those constraints must be relaxed to recover the benefits. The paper does not claim universal gains across every task or token; instead the improvements are strongest on reasoning-demanding tasks and on tokens requiring non-trivial contextual inference.

What to watch

Look for follow-up experiments that publish the quantitative breakdowns across tasks and the concrete hyperparameter values that define the paper's "small" and "critical" initializations. The arXiv submission provides a theory-meets-empirics framing and a recommended γ-initialization rule; replication on different architectures and datasets will confirm how broadly the condensation-then-expansion trajectory and token-level gains hold.

Technical reference: arXiv:2606.17945, submitted 16 Jun 2026; authors Liangkai Hang, Junjie Yao, Zhiyu Li, Feiyu Xiong, Hongkang Yang, Zhi-Qin John Xu; 26 pages, 8 figures.

Initialization scenarios and training outcomes (qualitative)drag / tap to compare

Output

Drives parameters to condense into low-complexity structures early, later expanding into richer representations; yields the largest gains on reasoning-demanding tasks and improves non-trivial, context-constrained token predictions.

Qualitative scenarios drawn from the paper's findings about initialization scale and dynamics.

Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement