SWave analysis: complex-valued recurrent LM, 169.26M params
An arXiv paper finds a failure mode called "cos-domination collapse" in SWave and traces architecture changes that led to best-step PPL.
TL;DR
- 01An arXiv paper finds a failure mode called "cos-domination collapse" in SWave and traces architecture changes that led to best-step PPL.
- 02The authors report a best-step validation perplexity of 22.0 at step 89,861 during a 200,000-step training run.
- 03SWave is a 169.26M-parameter, complex-valued recurrent language model with hidden-dimension D=384, depth L=16 and context length T=2048, trained on FineWeb-Edu using two H100 accelerators.
A new arXiv paper analyzes SWave, a complex-valued recurrent language model with 169.26M parameters (D=384, L=16, T=2048), trained on FineWeb-Edu using 2xH100 NVL, and documents why its original design failed and how the architecture evolved to stable long-run training. The authors report a best-step validation perplexity of 22.0 at step 89,861 during a 200,000-step training run.
What did the authors build and measure?
SWave is a 169.26M-parameter, complex-valued recurrent language model with hidden-dimension D=384, depth L=16 and context length T=2048, trained on FineWeb-Edu using two H100 accelerators. The paper presents three development phases of the model, retains two components across all phases (ComplexNorm and the Wave Propagation Scan), and records a best-step PPL of 22.0 at step 89,861 within a 200,000-step training regime.
The authors also detail a training-stack improvement: a parallel scan with a log-space backward pass for numerical stability. They claim stable 200,000-step training after architectural changes and include a formal characterisation of a failure mode they name "cos-domination collapse".
How did SWave evolve and what failed in the original design?
The original Resonance Head structurally permitted an imaginary-channel collapse, which the authors call "cos-domination collapse", and that degenerate minimum drove the design changes. The Resonance Head was replaced by an untied head within the Phase-Associative Memory (PAM) architecture, which includes independent real and imaginary embedding tables; that change resolved the degenerate minimum and enabled stable long-run training.
Other specifics from the evolution: ComplexNorm and the Wave Propagation Scan were retained through all three phases and marked as load-bearing. ProtectGatedScan was reframed from a learned behaviour into a structural prior. The authors found four multi-scale retention concepts showed no measurable improvement and classified them as non-load-bearing. The ComplexGatedUnit was ultimately superseded by a real-valued squared-ReLU channel mixer with fewer parameters. The paper also reports that auxiliary training objectives provided no benefit once structural constraints were fixed.
Why does the paper matter?
The paper supplies both a failure analysis and an engineering trace of fixes: it formally defines a collapse mode for complex-valued recurrent units, shows which structural choices were essential to avoid that mode, and gives practical stability measures such as a log-space backward pass for scan operations. Those deliverables go beyond a single-model result; the authors list six transferable engineering principles for complex-valued recurrent training and a plan-to-code traceability methodology intended to catch structural divergences that standard test suites miss.
Because the work pairs a concrete failure mode with reproducible fixes and a long-run training trace (200,000 steps, best-step PPL 22.0 at step 89,861), it provides a clear set of signals for researchers trying complex-valued recurrent architectures or analyzing recurrent stability.
What to watch
Attempted reproduction of the paper's stable 200,000-step training curve and the reported best-step PPL 22.0 at step 89,861 will be the clearest test of the paper's claims. Adoption of the paper's six engineering principles and the plan-to-code traceability checks in other complex-valued recurrent projects will indicate whether the fixes generalize beyond SWave.
References and concrete figures drawn from the paper: model size 169.26M parameters, D=384, L=16, T=2048, training on FineWeb-Edu with 2xH100 NVL, stable 200,000-step training, best-step PPL 22.0 at step 89,861, and the explicit failure mode name "cos-domination collapse".
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Foundation ModelsBIM-Edit: Benchmarking LLMs for IFC-based BIM Editing
BIM-Edit evaluates LLMs on 324 IFC editing tasks across 11 real models and 36 synthetic scenes; the top model averages 49.5%.
QMFOL benchmark: QMFOLBench with 2880 logic instances
QMFOL generates monadic first-order logic problems and ships QMFOLBench with 2880 instances to measure LLM deductive reasoning across.
DeFAb: Defeasible Abduction Benchmark, 372,648+ instances
DeFAb converts four decades of publicly funded knowledge bases into 372.
LLMs: gpt-4o, gpt-4.1-mini and claude-sonnet-4.6 study
Analysis of 21,000 multi-turn conversations finds human-like behaviors vary by model and user and can be modulated by system prompts.