Scaling Laws for Upcycling Mixture-of-Experts Language Models

Derives empirical scaling laws for sparse upcycling — the practice of converting a pretrained dense LLM into a Mixture-of-Experts model by duplicating the FFNs into experts and continuing training. Identifies a novel interaction term between the dense-pretraining token budget and the upcycled continued-training budget that limits how much additional quality can be recovered by upcycling at large compute budgets. In the high-compute regime, the benefit of upcycling saturates relative to training a comparable MoE from scratch.

Directly motivates and informs Sarashina2-8x70B, which is itself a sparse-upcycled MoE derived from Sarashina2-70B. ICML 2025. By Seng Pei Liew, Takuya Kato, and Sho Takase (SB Intuitions). Code and fitted laws released.

Paper (arXiv)GitHub OpenReview

Paper

arXiv: 2502.03009

Venue: ICML 2025

foundationalscalingmoe

Paper

Related