[2602.16543] Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning
Summary
This paper presents a framework for analyzing the vulnerabilities of Safe Reinforcement Learning (Safe RL) policies against adversarial attacks, emphasizing the limitations of existing methods in real-world scenarios.
Why It Matters
Understanding vulnerabilities in Safe RL is crucial as these systems are increasingly deployed in safety-critical applications. The proposed framework enhances the robustness of RL policies by addressing adversarial threats, thus contributing to the development of safer AI systems.
Key Takeaways
- Safe RL methods often assume benign environments, making them vulnerable.
- The proposed framework enables adversarial attacks without needing internal gradients.
- Theoretical analysis provides perturbation bounds for safer policy design.
- Experiments demonstrate the effectiveness of the approach under limited access.
- This research highlights the need for robust RL policies in real-world applications.
Computer Science > Machine Learning arXiv:2602.16543 (cs) [Submitted on 18 Feb 2026] Title:Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning Authors:Jialiang Fan, Shixiong Jiang, Mengyu Liu, Fanxin Kong View a PDF of the paper titled Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning, by Jialiang Fan and 3 other authors View PDF HTML (experimental) Abstract:Safe reinforcement learning (Safe RL) aims to ensure policy performance while satisfying safety constraints. However, most existing Safe RL methods assume benign environments, making them vulnerable to adversarial perturbations commonly encountered in real-world settings. In addition, existing gradient-based adversarial attacks typically require access to the policy's gradient information, which is often impractical in real-world scenarios. To address these challenges, we propose an adversarial attack framework to reveal vulnerabilities of Safe RL policies. Using expert demonstrations and black-box environment interaction, our framework learns a constraint model and a surrogate (learner) policy, enabling gradient-based attack optimization without requiring the victim policy's internal gradients or the ground-truth safety constraints. We further provide theoretical analysis establishing feasibility and deriving perturbation bounds. Experiments on multiple Safe RL benchmarks demonstrate the effectiveness of our approach un...