[2603.21563] Counterfactual Credit Policy Optimization for Multi-Agent Collaboration
About this article
Abstract page for arXiv paper 2603.21563: Counterfactual Credit Policy Optimization for Multi-Agent Collaboration
Computer Science > Artificial Intelligence arXiv:2603.21563 (cs) [Submitted on 23 Mar 2026] Title:Counterfactual Credit Policy Optimization for Multi-Agent Collaboration Authors:Zhongyi Li, Wan Tian, Yikun Ban, Jinju Chen, Huiming Zhang, Yang Liu, Fuzhen Zhuang View a PDF of the paper titled Counterfactual Credit Policy Optimization for Multi-Agent Collaboration, by Zhongyi Li and 6 other authors View PDF HTML (experimental) Abstract:Collaborative multi-agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles and aggregating diverse hypotheses. Yet, reinforcement learning (RL) for such systems is often undermined by credit assignment: a shared global reward obscures individual contributions, inflating update variance and encouraging free-riding. We introduce Counterfactual Credit Policy Optimization (CCPO), a framework that assigns agent-specific learning signals by estimating each agent's marginal contribution through counterfactual trajectories. CCPO builds dynamic counterfactual baselines that simulate outcomes with an agent's contribution removed, yielding role-sensitive advantages for policy optimization. To further improve stability under heterogeneous tasks and data distributions, we propose a global-history-aware normalization scheme that calibrates advantages using global rollout statistics. We evaluate CCPO on two collaboration topologies: a sequential Think--Reason dyad and multi-agent voting. Across mathematical and logical reas...