Proposes RITE, a reinforcement learning framework for training LLM agents that interleave reasoning with tool use across diverse domains. Enforces continuous Plan-Action-Reflection cycles that ground reasoning in intermediate tool outputs and enable self-correction during complex tasks — addressing the limitation that standard paradigms treat tool usage as linear or isolated events.

Introduces Dr. GRPO (token-level loss aggregation with importance sampling) to address reward sparsity and credit assignment in long tool-use trajectories, plus a dual-component reward system with dynamic curriculum via online rollout filtering. Demonstrates cross-domain generalization: agents trained on mathematical tasks transfer to coding, science, and other reasoning domains. By Chen, Yang, Xiao, Zhou, Zhang, Xi, Shi, Wang, Wang (Meituan + Zhejiang University + AI2 + CityU Hong Kong).

Paper

foundationalreasoningagenticreinforcement-learning