[2604.03098] Co-Evolution of Policy and Internal Reward for Language Agents
About this article
Abstract page for arXiv paper 2604.03098: Co-Evolution of Policy and Internal Reward for Language Agents
Computer Science > Machine Learning arXiv:2604.03098 (cs) [Submitted on 3 Apr 2026] Title:Co-Evolution of Policy and Internal Reward for Language Agents Authors:Xinyu Wang, Hanwei Wu, Jingwei Song, Shuyuan Zhang, Jiayi Zhang, Fanqi Kong, Tung Sum Thomas Kwok, Xiao-Wen Chang, Yuyu Luo, Chenglin Wu, Bang Liu View a PDF of the paper titled Co-Evolution of Policy and Internal Reward for Language Agents, by Xinyu Wang and 10 other authors View PDF HTML (experimental) Abstract:Large language model (LLM) agents learn by interacting with environments, but long-horizon training remains fundamentally bottlenecked by sparse and delayed rewards. Existing methods typically address this challenge through post-hoc credit assignment or external reward models, which provide limited guidance at inference time and often separate reward improvement from policy improvement. We propose Self-Guide, a self-generated internal reward for language agents that supports both inference-time guidance and training-time supervision. Specifically, the agent uses Self-Guide as a short self-guidance signal to steer the next action during inference, and converts the same signal into step-level internal reward for denser policy optimization during training. This creates a co-evolving loop: better policy produces better guidance, and better guidance further improves policy as internal reward. Across three agent benchmarks, inference-time self-guidance already yields clear gains, while jointly evolving policy an...