[2602.22697] Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue
Summary
The paper presents InteractCS-RL, a novel framework for enhancing task-oriented dialogue systems by balancing empathetic communication and cost-effectiveness through reinforcement learning.
Why It Matters
As AI-driven dialogue systems evolve, addressing the dual challenge of user satisfaction and operational cost is crucial for real-world applications. This research offers a structured approach to optimize these systems, potentially improving user experience and business efficiency.
Key Takeaways
- Introduces InteractCS-RL, a framework for task-oriented dialogue.
- Balances user engagement with cost management using reinforcement learning.
- Demonstrates significant performance improvements over existing methods.
Computer Science > Computation and Language arXiv:2602.22697 (cs) [Submitted on 26 Feb 2026] Title:Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue Authors:Ning Gao, Wei Zhang, Yuqin Dai, Ling Shi, Ziyin Wang, Yujie Wang, Wei He, Jinpeng Wang, Chaozheng Wang View a PDF of the paper titled Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue, by Ning Gao and 8 other authors View PDF HTML (experimental) Abstract:The rapid evolution of Large Language Models (LLMs) has accelerated the transition from conversational chatbots to general agents. However, effectively balancing empathetic communication with budget-aware decision-making remains an open challenge. Since existing methods fail to capture these complex strategic trade-offs, we propose InteractCS-RL, a framework that reframes task-oriented dialogue as a multi-granularity reinforcement learning process. Specifically, we first establish a User-centric Interaction Framework to provide a high-fidelity training gym, enabling agents to dynamically explore diverse strategies with persona-driven users. Then, we introduce Cost-aware Multi-turn Policy Optimization (CMPO) with a hybrid advantage estimation strategy. By integrating generative process credits and employing a PID-Lagrangian cost controller, CMPO effectively guides the policy to explore Pareto boundary between user reward and global cost constraints. Extensive experiments on customized re...