Spike No More: Stabilizing the Pre-training of Large Language Models
paperAnalyzes loss spikes during LLM pre-training — the sudden, training-destabilizing jumps that have plagued frontier-scale training runs — through spectral analysis of the Jacobian of each Transformer sub-layer. Identifies two architectural conditions that suffice to prevent spikes: small sub-layers (bounded sub-layer output magnitude) and large shortcut (sufficient residual-stream passthrough).
Empirically validated across a range of pretraining setups: models that satisfy both conditions train stably without spikes, while violating either is predictive of spike occurrence. Companion to industry gradient-clipping approaches such as AdaGC, but targeting the architectural side of stability rather than the optimization side. COLM 2025. By Sho Takase, Shun Kiyono, Sosuke Kobayashi (SB Intuitions), and Jun Suzuki (Tohoku).
Paper
arXiv: 2312.16903
Venue: COLM 2025