"The Science of Pretraining." 3B model trained on 8T tokens with 200+ controlled ablations. Scores 51.72 overall, matching OLMo-3 7B at half the parameters. MATH: 62.80 vs OLMo-3's 39.60.

Introduces the Data Darwinism framework for systematic data processing (L0-L9 taxonomy) and a two-stage adaptive curriculum. Full training trajectory released (logs, checkpoints, data mixtures).

Model Details

Architecture DENSE
Parameters 3B

Paper

arXiv: 2603.27164

open-sourceopen-weightdataresearch

Related