A systematic study of "zero RL training" — applying rule-based-reward reinforcement learning directly to base models (the DeepSeek-R1 paradigm) — across 10 diverse open base models rather than just the well-trodden Qwen2.5 series. Spans Llama-3.1-8B, Mistral-7B/24B, DeepSeek-Math-7B, Qwen2.5-Math-7B, and all Qwen2.5 sizes (0.5B–32B), motivated by the finding that Qwen2.5 bases already exhibit unusually strong instruction-following and self-reflection, so reproductions built only on them may not generalize.

Key design levers: adjusting the format reward and controlling query difficulty, which yield substantial reasoning-accuracy and response-length gains across most settings. By monitoring training dynamics, the authors show that longer responses do not always coincide with emergent cognitive behaviors like verification (the "aha moment") — and report the first observation of an "aha moment" in small non-Qwen models. Code, models, and analysis tools are open-sourced. COLM 2025.

Led by HKUST (Junxian He's NLP group; co-first authors Weihao Zeng and Yuzhen Huang; repo hkust-nlp/simpleRL-reason), in collaboration with ByteDance / TikTok (Qian Liu, Zejun Ma) and Meituan (Keqing He).

Paper

Venue COLM 2025
Authors: Weihao Zeng · Yuzhen Huang · Qian Liu · Wei Liu · Keqing He · Zejun Ma · Junxian He
reinforcement-learningpost-trainingreasoningresearch