"The AdEMAMix Optimizer: Better, Faster, Older." A modification of Adam using a mixture of two exponential moving averages to exploit older gradients that Adam discards. Demonstrates that gradients remain useful for tens of thousands of steps.

A 1.3B AdEMAMix LLM trained on 101B tokens matches an AdamW model trained on 197B tokens (+95% more data), demonstrating substantial training efficiency gains from better optimization alone. ICLR 2025. By Pagliardini, Ablin, and Grangier.

Paper

arXiv: 2409.03137

Venue: ICLR 2025

foundationalresearch