[2602.08104] Interpretable Failure Analysis in Multi-Agent Reinforcement Learning Systems

[2602.08104] Interpretable Failure Analysis in Multi-Agent Reinforcement Learning Systems

arXiv - Machine Learning 4 min read Article

Summary

This paper presents a novel framework for interpretable failure analysis in Multi-Agent Reinforcement Learning (MARL) systems, focusing on diagnosing cascading failures in safety-critical applications.

Why It Matters

As MARL systems are increasingly used in critical domains, understanding and interpreting failures is essential for ensuring safety and reliability. This framework enhances transparency in failure detection, which is crucial for trust in AI systems.

Key Takeaways

  • Introduces a two-stage framework for interpretable failure analysis in MARL.
  • Achieves high accuracy (88.2-99.4%) in detecting initial failure sources.
  • Utilizes geometric analysis to explain failure propagation and detection anomalies.
  • Moves beyond traditional black-box approaches to provide actionable diagnostics.
  • Enhances safety in critical applications by improving failure transparency.

Computer Science > Artificial Intelligence arXiv:2602.08104 (cs) [Submitted on 8 Feb 2026 (v1), last revised 22 Feb 2026 (this version, v2)] Title:Interpretable Failure Analysis in Multi-Agent Reinforcement Learning Systems Authors:Risal Shahriar Shefin, Debashis Gupta, Thai Le, Sarra Alqahtani View a PDF of the paper titled Interpretable Failure Analysis in Multi-Agent Reinforcement Learning Systems, by Risal Shahriar Shefin and 3 other authors View PDF HTML (experimental) Abstract:Multi-Agent Reinforcement Learning (MARL) is increasingly deployed in safety-critical domains, yet methods for interpretable failure detection and attribution remain underdeveloped. We introduce a two-stage gradient-based framework that provides interpretable diagnostics for three critical failure analysis tasks: (1) detecting the true initial failure source (Patient-0); (2) validating why non-attacked agents may be flagged first due to domino effects; and (3) tracing how failures propagate through learned coordination pathways. Stage 1 performs interpretable per-agent failure detection via Taylor-remainder analysis of policy-gradient costs, declaring an initial Patient-0 candidate at the first threshold crossing. Stage 2 provides validation through geometric analysis of critic derivatives-first-order sensitivity and directional second-order curvature aggregated over causal windows to construct interpretable contagion graphs. This approach explains "downstream-first" detection anomalies by reve...

Related Articles

Ai Agents

Considering NeurIPS submission [D]

Wondering if it worth submitting paper I’m working on to NeurIPS. I have formal mathematical proof for convergence of a novel agentic sys...

Reddit - Machine Learning · 1 min ·
Ai Agents

Agent frameworks waste ~350,000+ tokens per session resending static files. 95% reduction benchmarked.

Measured the actual token waste on a local Qwen 3.5 122B setup. The numbers are unreal. Found a compile-time approach that cuts query con...

Reddit - Artificial Intelligence · 1 min ·
OpenClaw gives users yet another reason to be freaked out about security - Ars Technica
Ai Agents

OpenClaw gives users yet another reason to be freaked out about security - Ars Technica

The viral AI agentic tool let attackers silently gain admin unauthenticated access.

Ars Technica - AI · 5 min ·
Robotics

What happens when you let AI agents run a sitcom 24/7 with zero human involvement

Ran an experiment — gave AI agents full control over writing, character creation, and performing a sitcom. Left it running nonstop for ov...

Reddit - Artificial Intelligence · 1 min ·
More in Ai Agents: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime