Empirical study across 3,700+ LLMs trained on 100T tokens establishing optimal hyperparameter scaling laws. Finds optimal learning rate follows a power-law with model and dataset size, while optimal batch size depends mainly on dataset size.

Paper

arXiv: 2503.04715

scalingtraining

Related