1.5 trillion-token "distilled" subset of common crawl data optimized for pre-training.
training-datatraining