Empirical scaling study showing that larger subword vocabularies reliably improve LLM quality at fixed parameter budgets — a finding that ran counter to the then-prevailing preference for small vocabularies. Also introduces a practical method for swapping vocabularies during continual pretraining, allowing a model trained on one language mix to be adapted to a new target-language vocabulary without restarting from scratch. This recipe is of particular importance for non-English LLM efforts and directly informed Sarashina's own 102K-token Japanese-SentencePiece vocabulary.

ACL Findings 2025. By Sho Takase, Ryokan Ri, Shun Kiyono, and Takuya Kato (all SB Intuitions).

Paper

arXiv: 2406.16508

Venue: ACL Findings 2025

foundationalscalingtokenization