Empirical study across 3,700+ LLMs trained on 100T tokens establishing optimal hyperparameter scaling laws. Finds optimal learning rate follows a power-law with model and dataset size, while optimal batch size depends mainly on dataset size.

Paper

scalingtraining

Related