Optimal Batch Size Scheduling via Functional Scaling Laws
paperIntroduces a principled framework for Batch Size Scheduling (BSS) based on functional scaling laws. The paper uncovers the "fast catch-up" effect, showing that for hard tasks, maintaining small batch sizes for most of training and switching to large batches late stage is optimal, substantially reducing data consumption without sacrificing performance.