"On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length." Studies how task horizon length affects RL training of LLM agents using controlled tasks where the same decision rules and reasoning structure are evaluated at varying action-sequence lengths.

Headline finding: longer horizons cause severe training instability and catastrophic collapse, while horizon reduction via macro actions recovers stability and dramatically improves success rates. Introduces horizon generalization — models trained at shorter horizons transfer to longer-horizon variants at inference time, while artificially re-extending horizons during training causes eventual collapse.

Primary model: Qwen3-1.7B; validated on 4B and on frontier closed models including GPT-5-mini and Gemini-3-Flash. Tasks: Sudoku (9×9 with variable empty cells as horizon knob), Rush Hour, WebShop. ICML 2026. Microsoft Research Asia (Liang Wang, Nan Yang, Xingxing Zhang, Furu Wei) with Yonsei University (Sunghwan Kim, Junhee Cho, Beong-woo Kwak, Taeyoon Kwon, Jinyoung Yeo).

Paper

Venue ICML 2026
reasoningagents