Paloma
datasetPerplexity-based benchmark measuring language model fit across 585 textual domains from 18 sources, rather than assuming perplexity on one distribution generalizes. Includes 6 controlled 1B baselines and standardizes evaluation via bits-per-byte and variance-aware domain sampling.
Paper
arXiv: 2312.10523