[2602.20804] Probing Dec-POMDP Reasoning in Cooperative MARL
Summary
This paper examines the effectiveness of benchmarks in cooperative multi-agent reinforcement learning (MARL) by analyzing Dec-POMDP reasoning, revealing that many benchmarks may not adequately test core assumptions.
Why It Matters
Understanding the limitations of current benchmarks in cooperative MARL is crucial for advancing research in multi-agent systems. The findings suggest that existing metrics may lead to over-optimistic evaluations of agent performance, impacting future developments in the field.
Key Takeaways
- Dec-POMDP reasoning is essential for effective coordination in MARL.
- Many popular benchmarks do not require genuine Dec-POMDP reasoning for success.
- Reactive policies often perform as well as memory-based agents in various scenarios.
- Emergent coordination can be fragile, relying on synchronous actions.
- The authors provide diagnostic tools to improve benchmark design and evaluation.
Computer Science > Machine Learning arXiv:2602.20804 (cs) [Submitted on 24 Feb 2026] Title:Probing Dec-POMDP Reasoning in Cooperative MARL Authors:Kale-ab Tessera, Leonard Hinckeldey, Riccardo Zamboni, David Abel, Amos Storkey View a PDF of the paper titled Probing Dec-POMDP Reasoning in Cooperative MARL, by Kale-ab Tessera and 4 other authors View PDF Abstract:Cooperative multi-agent reinforcement learning (MARL) is typically framed as a decentralised partially observable Markov decision process (Dec-POMDP), a setting whose hardness stems from two key challenges: partial observability and decentralised coordination. Genuinely solving such tasks requires Dec-POMDP reasoning, where agents use history to infer hidden states and coordinate based on local information. Yet it remains unclear whether popular benchmarks actually demand this reasoning or permit success via simpler strategies. We introduce a diagnostic suite combining statistically grounded performance comparisons and information-theoretic probes to audit the behavioural complexity of baseline policies (IPPO and MAPPO) across 37 scenarios spanning MPE, SMAX, Overcooked, Hanabi, and MaBrax. Our diagnostics reveal that success on these benchmarks rarely requires genuine Dec-POMDP reasoning. Reactive policies match the performance of memory-based agents in over half the scenarios, and emergent coordination frequently relies on brittle, synchronous action coupling rather than robust temporal influence. These findings s...