Chinchilla (Compute-Optimal Training)
paper"Training Compute-Optimal Large Language Models." Found that existing LLMs were significantly undertrained: for compute-optimal training, model size and training tokens should scale equally. Chinchilla (70B, 1.4T tokens) outperformed Gopher (280B, 300B tokens) on MMLU (67.5% vs 60%).
The Chinchilla scaling laws reshaped training decisions industry-wide, shifting labs from building the largest possible models to training smaller models on far more data. Directly influenced Llama, Mistral, and other efficient model families. NeurIPS 2022. By Hoffmann et al. (DeepMind).
Paper
arXiv: 2203.15556
Venue: NeurIPS 2022