Two-stage reinforcement learning framework for LLMs that surpasses DeepSeek-R1-Zero-32B on AIME24 and LiveCodeBench with only 1/10 of the training steps.

Paper

reasoningtrainingefficiency

Related