[2501.18138] B3C: A Minimalist Approach to Offline Multi-Agent Reinforcement Learning
Summary
The paper presents B3C, a novel approach to offline multi-agent reinforcement learning that addresses overestimation issues by integrating behavior cloning regularization with critic clipping.
Why It Matters
This research is significant as it tackles a critical challenge in offline reinforcement learning, particularly in multi-agent environments where traditional methods struggle. By improving performance through B3C, it contributes to advancements in AI systems that rely on collaborative decision-making.
Key Takeaways
- B3C combines behavior cloning regularization with critic clipping to enhance policy evaluation.
- The method effectively mitigates overestimation issues prevalent in multi-agent settings.
- B3C outperforms existing state-of-the-art algorithms in offline multi-agent benchmarks.
- Non-linear value factorization techniques are leveraged for improved performance.
- The approach is a minimalist adaptation of successful single-agent strategies to multi-agent contexts.
Computer Science > Machine Learning arXiv:2501.18138 (cs) [Submitted on 30 Jan 2025 (v1), last revised 12 Feb 2026 (this version, v3)] Title:B3C: A Minimalist Approach to Offline Multi-Agent Reinforcement Learning Authors:Woojun Kim, Katia Sycara View a PDF of the paper titled B3C: A Minimalist Approach to Offline Multi-Agent Reinforcement Learning, by Woojun Kim and 1 other authors View PDF HTML (experimental) Abstract:Overestimation arising from selecting unseen actions during policy evaluation is a major challenge in offline reinforcement learning (RL). A minimalist approach in the single-agent setting -- adding behavior cloning (BC) regularization to existing online RL algorithms -- has been shown to be effective; however, this approach is understudied in multi-agent settings. In particular, overestimation becomes worse in multi-agent settings due to the presence of multiple actions, resulting in the BC regularization-based approach easily suffering from either over-regularization or critic divergence. To address this, we propose a simple yet effective method, Behavior Cloning regularization with Critic Clipping (B3C), which clips the target critic value in policy evaluation based on the maximum return in the dataset and pushes the limit of the weight on the RL objective over BC regularization, thereby improving performance. Additionally, we leverage existing value factorization techniques, particularly non-linear factorization, which is understudied in offline setting...