Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning

Proposes Warmup-Stable-Only (WSO): a pretraining learning-rate schedule that holds the learning rate constant after warmup, with no decay phase. Although WSO shows slightly worse pretraining loss than cosine-decay or warmup-stable-decay schedules, models pretrained with WSO consistently outperform decay-based schedules after supervised fine-tuning at 1B and 8B scale. Loss-landscape analysis shows decay schedules converge to sharper minima while WSO preserves flatter minima that generalize better under downstream adaptation.

Provides a clean, actionable recommendation for labs that release base checkpoints intended to be fine-tuned by downstream users: skip the decay. Pairs with the optimizer and clipping line of work (AdEMAMix, AdaGC). ICLR 2026. By Kazuki Yano (Tohoku), Shun Kiyono, Sosuke Kobayashi, Sho Takase (SB Intuitions), and Jun Suzuki (Tohoku).

Paper (arXiv)

Paper

arXiv: 2603.16127

Venue: ICLR 2026

foundationaltraining-stabilityoptimizer