CCI 4.0
datasetLarge-scale bilingual pretraining dataset (~35 TB) engineered for superior data quality and diverse reasoning trajectories. Comprises CCI4.0-M2-Base (5.2 TB curated Chinese web corpus + 22.5 TB English from Nemotron-CC + diverse math, wiki, arXiv, code sources) and CCI4.0-M2-CoT (4.5B synthesized chain-of-thought reasoning templates embedding diverse reasoning patterns). Released alongside OpenSeek as part of the OpenSeek project stage one. Successor to CCI 3.0 (Sep 2024) and earlier versions.