Data Darwinism / Darwin Corpora
datasetSystematic data curation framework with L0-L9 quality taxonomy. Darwin-Science: 900B-token scientific corpus (+5.60/+8.40 on domain tasks). Darwin-CC: 504B tokens from 672B across 8 categories, 30 iterations per category. Surpasses DCLM, Ultra-FineWeb, and FineWeb-Edu.
Darwin-CC: 1.02B HuggingFace downloads, 3K+ likes.
Paper
arXiv: 2602.07824