[2602.10917] Near-Constant Strong Violation and Last-Iterate Convergence for Online CMDPs via Decaying Safety Margins
About this article
Abstract page for arXiv paper 2602.10917: Near-Constant Strong Violation and Last-Iterate Convergence for Online CMDPs via Decaying Safety Margins
Computer Science > Machine Learning arXiv:2602.10917 (cs) [Submitted on 11 Feb 2026 (v1), last revised 3 Mar 2026 (this version, v2)] Title:Near-Constant Strong Violation and Last-Iterate Convergence for Online CMDPs via Decaying Safety Margins Authors:Qian Zuo, Zhiyong Wang, Fengxiang He View a PDF of the paper titled Near-Constant Strong Violation and Last-Iterate Convergence for Online CMDPs via Decaying Safety Margins, by Qian Zuo and 2 other authors View PDF HTML (experimental) Abstract:We study safe online reinforcement learning in Constrained Markov Decision Processes (CMDPs) under strong regret and violation metrics, which forbid error cancellation over time. Existing primal-dual methods that achieve sublinear strong reward regret inevitably incur growing strong constraint violation or are restricted to average-iterate convergence due to inherent oscillations. To address these limitations, we propose the Flexible safety Domain Optimization via Margin-regularized Exploration (FlexDOME) algorithm, the first to provably achieve near-constant $\tilde{O}(1)$ strong constraint violation alongside sublinear strong regret and non-asymptotic last-iterate convergence. FlexDOME incorporates time-varying safety margins and regularization terms into the primal-dual framework. Our theoretical analysis relies on a novel term-wise asymptotic dominance strategy, where the safety margin is rigorously scheduled to asymptotically majorize the functional decay rates of the optimization...