Framework for training LLMs to autonomously discover tool-use strategies via reinforcement learning rather than supervised fine-tuning. Models learn when and how to invoke computational tools (code execution, calculators) purely through reward-driven optimization, with no explicit instruction on tool-use patterns.

Demonstrates emergent behaviors: strategic tool invocation, self-regulation of ineffective code, and adaptive switching between computational and analytical reasoning — all arising organically from RL training. ToRL-7B achieves 43.3% on AIME 2024, outperforming RL without tools by 14% and the best existing Tool-Integrated Reasoning model by 17%. Open-sourced with implementation, datasets, and models. By Xuefeng Li, Haoyang Zou, Pengfei Liu (SJTU + SII GAIR).

Paper

foundationalreasoningreinforcement-learningcoding