SRPO: Staged History-Resampling Policy Optimization

Two-stage reinforcement learning framework for LLMs that surpasses DeepSeek-R1-Zero-32B on AIME24 and LiveCodeBench with only 1/10 of the training steps.

No results found