Systematic method using scaling laws to determine the optimal data mixture for any target domain, replacing trial-and-error approaches. Predicts model loss as a function of model size, training tokens, and domain weights.

Validated across LLM, native multimodal model, and large vision model pre-training. Parameters estimated from small-scale runs extrapolate to larger scales and unseen domain weights. NeurIPS 2025. By Shukor, Bethune, Busbridge, Grangier, Fini, El-Nouby, and Ablin.

Paper

arXiv: 2507.09404

Venue: NeurIPS 2025

researchfoundational

Related