[2602.15854] Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization
Summary
This paper presents Goal-Oriented Preference Optimization (GOPO), a new framework for enhancing task-oriented dialogue systems by decoupling strategy planning from response generation, leading to improved performance in e-commerce applications.
Why It Matters
The research addresses limitations in current dialogue systems that often fail to align training methods with long-term task success. By introducing GOPO, the authors provide a novel approach that could significantly enhance the effectiveness of AI in customer service and other task-focused dialogues, making it relevant for both academic research and practical applications in AI-driven industries.
Key Takeaways
- GOPO decouples strategy from execution in dialogue systems, improving task success rates.
- The framework employs a hierarchical reinforcement learning approach with two distinct agents.
- Evaluation on public benchmarks shows significant performance improvements over existing methods.
- Ablation studies highlight the importance of the Expert Agent in optimizing long-term goals.
- The research establishes a new paradigm for commercial task-oriented dialogue systems.
Computer Science > Computation and Language arXiv:2602.15854 (cs) [Submitted on 24 Jan 2026] Title:Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization Authors:Jingyi Xu, Xingyu Ren, Zhiqiang You, Yumeng Zhang, Zhoupeng Shou View a PDF of the paper titled Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization, by Jingyi Xu and 4 other authors View PDF HTML (experimental) Abstract:Large language models show potential in task-oriented dialogue systems, yet existing training methods often rely on token-level likelihood or preference optimization, which poorly align with long-horizon task success. To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent. The Expert Agent optimizes multi-turn goal preferences at the dialogue-trajectory level, while the Customer Service Agent generates responses strictly aligned with the selected strategy. We evaluate GOPO on public benchmarks and e-commerce customer service datasets, and introduce Task-focused Sequential Engagement (TSE), a sequence-level metric derived from real e-commerce interaction data. On the Mgshop dataset, GOPO improves TSE by 7.7% and 10.3% over PPO and Memento, with consistent gains in sequence-level reward and generation quality. Furthermore, a 14B model t...