Automated framework for discovering optimal pretraining data mixtures through semantic clustering and iterative optimization. Uses a proxy model to evaluate candidate mixtures, clusters data into 20 semantic groups, then searches for the ideal combination through iterative refinement — replacing manual data curation with a principled optimization loop.

A 1B model trained on the optimized mixture (400B tokens) surpasses Llama-3.2-1B by 2.0%; domain-specific optimization yields 5% gains over random sampling. Releases two datasets: Nemotron-ClimbLab (1.2T tokens, 20 semantic clusters for research) and Nemotron-ClimbMix (400B tokens, optimized mixture for efficient pretraining). NeurIPS 2025.

Outputs 3

Nemotron-CLIMB Paper

paper
Venue NeurIPS 2025

Nemotron-ClimbLab (1.2T tokens)

dataset

1.2 trillion token filtered corpus organized into 20 semantic clusters for data mixture research.

Size 1.2T tokens
Format text

Nemotron-ClimbMix (400B tokens)

dataset

400 billion token curated mixture optimized for efficient pretraining. Outperforms Llama-3.2-1B training data at equivalent token budgets.

Size 400B tokens
Format text
datainfrastructurescalingfoundational