[2510.14967] Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn Search Agents
About this article
Abstract page for arXiv paper 2510.14967: Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn Search Agents
Computer Science > Computation and Language arXiv:2510.14967 (cs) [Submitted on 16 Oct 2025 (v1), last revised 24 Mar 2026 (this version, v2)] Title:Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn Search Agents Authors:Guoqing Wang, Sunhao Dai, Guangze Ye, Zeyu Gan, Wei Yao, Yong Deng, Xiaofeng Wu, Zhenzhe Ying View a PDF of the paper titled Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn Search Agents, by Guoqing Wang and 7 other authors View PDF HTML (experimental) Abstract:Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. However, existing approaches typically rely on outcome-based rewards that are only provided exclusively upon generating the final answer. This reward sparsity becomes particularly problematic in multi-turn settings, where long trajectories exacerbate three critical issues: (i) advantage collapse, where all rollouts receive identical rewards and provide no useful learning signals; (ii) lack of fine-grained credit assignment, where the correctness of intermediate turns is obscured, especially in long-horizon tasks; and (iii) poor sample efficiency, where each rollout yields only a single outcome signal, leading to low data utilization. In this p...