Scaling Laws for Optimal Data Mixtures

Systematic method using scaling laws to determine the optimal data mixture for any target domain, replacing trial-and-error approaches. Predicts model loss as a function of model size, training tokens, and domain weights.

Validated across LLM, native multimodal model, and large vision model pre-training. Parameters estimated from small-scale runs extrapolate to larger scales and unseen domain weights. NeurIPS 2025. By Shukor, Bethune, Busbridge, Grangier, Fini, El-Nouby, and Ablin.

Paper (arXiv)Apple ML Research

Paper

Venue NeurIPS 2025

arXiv HTML

researchfoundational

Paper

Related