[2602.12268] CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use
Summary
The paper presents CM2, a novel reinforcement learning framework that utilizes checklist rewards to enhance multi-turn and multi-step tool use in AI agents, improving performance over traditional methods.
Why It Matters
CM2 addresses significant challenges in reinforcement learning for AI agents, particularly in environments requiring complex interactions. By introducing checklist rewards, it offers a more stable and scalable approach to training, which could lead to advancements in AI applications across various domains.
Key Takeaways
- CM2 replaces traditional verifiable rewards with checklist rewards for better stability.
- The framework allows for fine-grained evaluation of agent performance through binary criteria.
- Training is conducted in a scalable simulated environment, reducing engineering costs.
- CM2 demonstrates significant performance improvements over supervised fine-tuning.
- The approach can optimize multi-turn, multi-step tool-using agents effectively.
Computer Science > Artificial Intelligence arXiv:2602.12268 (cs) [Submitted on 12 Feb 2026 (v1), last revised 20 Feb 2026 (this version, v2)] Title:CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use Authors:Zhen Zhang, Kaiqiang Song, Xun Wang, Yebowen Hu, Weixiang Yan, Chenyang Zhao, Henry Peng Zou, Haoyun Deng, Sathish Reddy Indurthi, Shujian Liu, Simin Ma, Xiaoyang Wang, Xin Eric Wang, Song Wang View a PDF of the paper titled CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use, by Zhen Zhang and 13 other authors View PDF Abstract:AI agents are increasingly used to solve real-world tasks by reasoning over multi-turn user interactions and invoking external tools. However, applying reinforcement learning to such settings remains difficult: realistic objectives often lack verifiable rewards and instead emphasize open-ended behaviors; moreover, RL for multi-turn, multi-step agentic tool use is still underexplored; and building and maintaining executable tool environments is costly, limiting scale and coverage. We propose CM2, an RL framework that replaces verifiable outcome rewards with checklist rewards. CM2 decomposes each turn's intended behavior into fine-grained binary criteria with explicit evidence grounding and structured metadata, turning open-ended judging into more stable classification-style decisions. To balance stability and informativeness, our method adopts a strategy ...