Refined scaling law for predicting LLM training loss across scales. Introduces a loss surface function L(N,D) that reduces extrapolation error by 433% vs Chinchilla. Trained ~1,000 models using ~3M GPU hours with full open-source release.

Paper

arXiv: 2506.10972

scalingtraining

Related