SimpleRL-Zoo: Investigating and Taming Zero RL for Open Base Models
paperA systematic study of "zero RL training" — applying rule-based-reward reinforcement learning directly to base models (the DeepSeek-R1 paradigm) — across 10 diverse open base models rather than just the well-trodden Qwen2.5 series. Spans Llama-3.1-8B, Mistral-7B/24B, DeepSeek-Math-7B, Qwen2.5-Math-7B, and all Qwen2.5 sizes (0.5B–32B), motivated by the finding that Qwen2.5 bases already exhibit unusually strong instruction-following and self-reflection, so reproductions built only on them may not generalize.
Key design levers: adjusting the format reward and controlling query difficulty, which yield substantial reasoning-accuracy and response-length gains across most settings. By monitoring training dynamics, the authors show that longer responses do not always coincide with emergent cognitive behaviors like verification (the "aha moment") — and report the first observation of an "aha moment" in small non-Qwen models. Code, models, and analysis tools are open-sourced. COLM 2025.
Led by HKUST (Junxian He's NLP group; co-first authors Weihao Zeng and Yuzhen Huang; repo hkust-nlp/simpleRL-reason), in collaboration with ByteDance / TikTok (Qian Liu, Zejun Ma) and Meituan (Keqing He).