AICC: A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser
datasetA rethink of web-corpus construction that treats HTML-to-text extraction — usually a fixed preprocessing step — as a learned task. The paper releases three artifacts: MinerU-HTML, an SLM-powered main-content extractor (a 0.6B model that reformulates extraction as sequence labeling with state-machine-guided generation, emitting clean Markdown that preserves code, formulas, and tables); MainWebBench, a benchmark of 7,887 annotated pages on which MinerU-HTML reaches 81.8% ROUGE-N F1 vs Trafilatura's 63.6%; and AICC, a 7.3-trillion-token AI-ready corpus extracted from Common Crawl with it.
In matched-filtering pretraining experiments, models trained on AICC (62B tokens) reach 50.8% average accuracy across 13 benchmarks, +1.08pp over the Trafilatura-extracted TfCC baseline — evidence that better HTML parsing, not just filtering and dedup, improves downstream models. Built by Shanghai AI Laboratory / OpenDataLab (Conghui He, Dahua Lin).