Smooth Scaling Laws Hide Stepwise Token Learning

Scaling-science paper from Dots Studio (the team behind dots.llm1, led by Debing Zhang) showing that the smooth power-law form of language-model loss curves emerges from a spectrum of discrete, sigmoid-shaped token-level learning events. Validated across 110+ pretraining runs (290M–6B MoE models, up to 300B tokens, 1,178 A100 GPU-days), the learning-time spectrum quantitatively predicts validation loss across training steps, data scale, and model scale. Practical payoff: reshaping the training distribution by token learnability yields an 11% faster validation-loss reduction.

Paper (arXiv)

Paper

arXiv HTML

Authors: Pingjie Wang · Zechen Hu · Peiru Yang · Fu Guo · Debing Zhang

scalingresearch

Paper

Related