[2602.22786] QSIM: Mitigating Overestimation in Multi-Agent Reinforcement Learning via Action Similarity Weighted Q-Learning
Summary
The paper introduces QSIM, a novel framework that addresses the issue of Q-value overestimation in multi-agent reinforcement learning (MARL) by using action similarity to improve learning stability and performance.
Why It Matters
Q-value overestimation is a significant challenge in MARL, leading to unstable learning and suboptimal policies. QSIM's approach to mitigating this issue is crucial for advancing the effectiveness of collaborative AI systems, making it relevant for researchers and practitioners in AI and machine learning.
Key Takeaways
- QSIM mitigates Q-value overestimation in MARL through action similarity.
- The framework enhances learning stability by smoothing TD targets with behaviorally related actions.
- QSIM can be integrated with existing value decomposition methods for improved performance.
- Empirical results show significant reductions in systematic value overestimation.
- The proposed method is applicable across various MARL algorithms.
Computer Science > Multiagent Systems arXiv:2602.22786 (cs) [Submitted on 26 Feb 2026] Title:QSIM: Mitigating Overestimation in Multi-Agent Reinforcement Learning via Action Similarity Weighted Q-Learning Authors:Yuanjun Li, Bin Zhang, Hao Chen, Zhouyang Jiang, Dapeng Li, Zhiwei Xu View a PDF of the paper titled QSIM: Mitigating Overestimation in Multi-Agent Reinforcement Learning via Action Similarity Weighted Q-Learning, by Yuanjun Li and 5 other authors View PDF HTML (experimental) Abstract:Value decomposition (VD) methods have achieved remarkable success in cooperative multi-agent reinforcement learning (MARL). However, their reliance on the max operator for temporal-difference (TD) target calculation leads to systematic Q-value overestimation. This issue is particularly severe in MARL due to the combinatorial explosion of the joint action space, which often results in unstable learning and suboptimal policies. To address this problem, we propose QSIM, a similarity weighted Q-learning framework that reconstructs the TD target using action similarity. Instead of using the greedy joint action directly, QSIM forms a similarity weighted expectation over a structured near-greedy joint action space. This formulation allows the target to integrate Q-values from diverse yet behaviorally related actions while assigning greater influence to those that are more similar to the greedy choice. By smoothing the target with structurally relevant alternatives, QSIM effectively mitigate...