[2602.08104] Interpretable Failure Analysis in Multi-Agent Reinforcement Learning Systems
Summary
This paper presents a novel framework for interpretable failure analysis in Multi-Agent Reinforcement Learning (MARL) systems, focusing on diagnosing cascading failures in safety-critical applications.
Why It Matters
As MARL systems are increasingly used in critical domains, understanding and interpreting failures is essential for ensuring safety and reliability. This framework enhances transparency in failure detection, which is crucial for trust in AI systems.
Key Takeaways
- Introduces a two-stage framework for interpretable failure analysis in MARL.
- Achieves high accuracy (88.2-99.4%) in detecting initial failure sources.
- Utilizes geometric analysis to explain failure propagation and detection anomalies.
- Moves beyond traditional black-box approaches to provide actionable diagnostics.
- Enhances safety in critical applications by improving failure transparency.
Computer Science > Artificial Intelligence arXiv:2602.08104 (cs) [Submitted on 8 Feb 2026 (v1), last revised 22 Feb 2026 (this version, v2)] Title:Interpretable Failure Analysis in Multi-Agent Reinforcement Learning Systems Authors:Risal Shahriar Shefin, Debashis Gupta, Thai Le, Sarra Alqahtani View a PDF of the paper titled Interpretable Failure Analysis in Multi-Agent Reinforcement Learning Systems, by Risal Shahriar Shefin and 3 other authors View PDF HTML (experimental) Abstract:Multi-Agent Reinforcement Learning (MARL) is increasingly deployed in safety-critical domains, yet methods for interpretable failure detection and attribution remain underdeveloped. We introduce a two-stage gradient-based framework that provides interpretable diagnostics for three critical failure analysis tasks: (1) detecting the true initial failure source (Patient-0); (2) validating why non-attacked agents may be flagged first due to domino effects; and (3) tracing how failures propagate through learned coordination pathways. Stage 1 performs interpretable per-agent failure detection via Taylor-remainder analysis of policy-gradient costs, declaring an initial Patient-0 candidate at the first threshold crossing. Stage 2 provides validation through geometric analysis of critic derivatives-first-order sensitivity and directional second-order curvature aggregated over causal windows to construct interpretable contagion graphs. This approach explains "downstream-first" detection anomalies by reve...