SRPO: Staged History-Resampling Policy Optimization
paperTwo-stage reinforcement learning framework for LLMs that surpasses DeepSeek-R1-Zero-32B on AIME24 and LiveCodeBench with only 1/10 of the training steps.
Paper
arXiv: 2504.14286
arXiv: 2504.14286