[2603.24709] Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards
About this article
Abstract page for arXiv paper 2603.24709: Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards
Computer Science > Machine Learning arXiv:2603.24709 (cs) [Submitted on 25 Mar 2026] Title:Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards Authors:Cheng Jiayang, Xin Liu, Zhihan Zhang, Haoyang Wen, Zixuan Zhang, Qingyu Yin, Shiyang Li, Priyanka Nigam, Bing Yin, Chao Zhang, Yangqiu Song View a PDF of the paper titled Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards, by Cheng Jiayang and 10 other authors View PDF HTML (experimental) Abstract:Multi-step tool orchestration, where LLMs must invoke multiple dependent APIs in the correct order while propagating intermediate outputs, remains challenging. State-of-the-art models frequently fail on full sequence execution, with parameter value errors accounting for a significant portion of failures. Training models to handle such workflows faces two obstacles: existing environments focus on simple per-turn function calls with simulated data, and binary rewards provide no signal for partial correctness. We present a framework addressing both challenges. First, we construct a reinforcement learning environment backed by a large-scale cache of real API responses, enabling a data synthesis pipeline that samples valid multi-step orchestration traces with controllable complexity and significantly higher generation efficiency than unconstrained methods. Second, we propose a graduated reward design that decomposes correctness into atom...