AdEMAMix Optimizer

"The AdEMAMix Optimizer: Better, Faster, Older." A modification of Adam using a mixture of two exponential moving averages to exploit older gradients that Adam discards. Demonstrates that gradients remain useful for tens of thousands of steps.

A 1.3B AdEMAMix LLM trained on 101B tokens matches an AdamW model trained on 197B tokens (+95% more data), demonstrating substantial training efficiency gains from better optimization alone. ICLR 2025. By Pagliardini, Ablin, and Grangier.

Paper (arXiv)Apple ML Research

Paper

Venue ICLR 2025

arXiv HTML

foundationalresearch