AI Lab Tracker
Labs
Timeline
WanJuan 1.0 Corpus
dataset
2023-08-21
PJLab
Massive high-quality multimodal pre-training corpus containing over 2TB of English and Chinese text, image-text pairs, and video.
Paper (arXiv)
GitHub
training-data
training
multimodal
Related
wanjuan-cc
wanjuan-3.0
internlm-1.0
Notes
arXiv submission Aug 21, 2023.