AI Lab Tracker
Labs
Timeline
What's New
RefinedWeb
dataset
2023-06-01
TII
~5 trillion token open web corpus from CommonCrawl demonstrating that properly filtered web data alone can outperform curated corpora. 600B token public extract released under ODC-By 1.0. Training data for all Falcon models.
Paper (arXiv)
HuggingFace
Paper
arXiv
HTML
Dataset
Website
data
open-source
Related
falcon