A rethink of web-corpus construction that treats HTML-to-text extraction — usually a fixed preprocessing step — as a learned task. The paper releases three artifacts: MinerU-HTML, an SLM-powered main-content extractor (a 0.6B model that reformulates extraction as sequence labeling with state-machine-guided generation, emitting clean Markdown that preserves code, formulas, and tables); MainWebBench, a benchmark of 7,887 annotated pages on which MinerU-HTML reaches 81.8% ROUGE-N F1 vs Trafilatura's 63.6%; and AICC, a 7.3-trillion-token AI-ready corpus extracted from Common Crawl with it.

In matched-filtering pretraining experiments, models trained on AICC (62B tokens) reach 50.8% average accuracy across 13 benchmarks, +1.08pp over the Trafilatura-extracted TfCC baseline — evidence that better HTML parsing, not just filtering and dedup, improves downstream models. Built by Shanghai AI Laboratory / OpenDataLab (Conghui He, Dahua Lin).

Paper

Dataset

Size 7.3T tokens
Format Markdown
training-datatrainingbenchmarknlp

Related