A 1.0T token high-quality English webtext dataset derived from Common Crawl, also known as WanJuan 2.0. Specifically designed for pre-training large language models with a focus on safety and high information density.

Paper

arXiv: 2402.19282

training-datatrainingnlp

Related