Efficient RL Training for LLMs with Experience Replay

Challenges the prevailing assumption that fresh, on-policy data is necessary for effective LLM reinforcement learning. Systematically studies replay buffers for LLM post-training (RLHF, GRPO, RLVR), formalizing the design space as a tradeoff between staleness-induced variance, sample diversity, and generation cost.

Demonstrates that well-designed replay buffers can drastically reduce inference compute without degrading final model performance or policy entropy — strict on-policy sampling is suboptimal when rollout generation is the computational bottleneck. A practical drop-in improvement for any lab's RL post-training pipeline. By Charles Arnal, Vivien Cabannes, Taco Cohen, Julia Kempe, and Remi Munos (FAIR at Meta + NYU Courant).

Paper (arXiv)

Paper

arXiv HTML

foundationalalignmentefficiency