llm-jp-corpus
datasetLargest Japanese web training corpus. 312.1B characters / 173M pages from Common Crawl, Wikipedia, and archived technical papers (Kaken). Includes specialized filtering for Japanese text quality. Evolved through v1-v4 across LLM-jp generations.
Paper
arXiv: 2404.17733