Analyzes loss spikes during LLM pre-training — the sudden, training-destabilizing jumps that have plagued frontier-scale training runs — through spectral analysis of the Jacobian of each Transformer sub-layer. Identifies two architectural conditions that suffice to prevent spikes: small sub-layers (bounded sub-layer output magnitude) and large shortcut (sufficient residual-stream passthrough).

Empirically validated across a range of pretraining setups: models that satisfy both conditions train stably without spikes, while violating either is predictive of spike occurrence. Companion to industry gradient-clipping approaches such as AdaGC, but targeting the architectural side of stability rather than the optimization side. COLM 2025. By Sho Takase, Shun Kiyono, Sosuke Kobayashi (SB Intuitions), and Jun Suzuki (Tohoku).

Paper

arXiv: 2312.16903

Venue: COLM 2025

foundationaltraining-stability