RedStone
datasetScalable pipeline for curating LLM pretraining data from Common Crawl. Produces ~3.17 trillion tokens across four data types: general web text, code, math, and QA. Pipeline has three modules: Collection (extraction and deduplication), Filtering (quality classifiers and heuristic rules), and Extraction (domain-specific parsers for code, math, and QA content from web pages).
Open-sourced with full reproduction scripts. All datasets verified in scale and quality as comparable to Microsoft's internal pretraining data. By Chang, Cui, Dong, Huang, Wei et al. (Microsoft Research).