[2603.23889] Off-Policy Safe Reinforcement Learning with Constrained

[2603.23889] Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration

arXiv - Machine Learning March 26, 2026 3 min read

About this article

Abstract page for arXiv paper 2603.23889: Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration

Computer Science > Machine Learning arXiv:2603.23889 (cs) [Submitted on 25 Mar 2026] Title:Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration Authors:Guopeng Li, Matthijs T.J. Spaan, Julian F.P. Kooij View a PDF of the paper titled Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration, by Guopeng Li and 2 other authors View PDF HTML (experimental) Abstract:When safety is formulated as a limit of cumulative cost, safe reinforcement learning (RL) aims to learn policies that maximize return subject to the cost constraint in data collection and deployment. Off-policy safe RL methods, although offering high sample efficiency, suffer from constraint violations due to cost-agnostic exploration and estimation bias in cumulative cost. To address this issue, we propose Constrained Optimistic eXploration Q-learning (COX-Q), an off-policy safe RL algorithm that integrates cost-bounded online exploration and conservative offline distributional value learning. First, we introduce a novel cost-constrained optimistic exploration strategy that resolves gradient conflicts between reward and cost in the action space and adaptively adjusts the trust region to control the training cost. Second, we adopt truncated quantile critics to stabilize the cost value learning. Quantile critics also quantify epistemic uncertainty to guide exploration. Experiments on safe velocity, safe navigation, and autonomous driving tasks demonstrate that COX-Q ach...

Originally published on March 26, 2026. Curated by AI News.

Ai Safety

Washington needs AI guardrails — now | Opinion

We need legislation that draws clear lines on what AI systems may and may not do on behalf of the United States government

AI Tools & Products · 3 min · about 5 hours ago

Ai Safety

[2601.12910] SciCoQA: Quality Assurance for Scientific Paper--Code Alignment

Abstract page for arXiv paper 2601.12910: SciCoQA: Quality Assurance for Scientific Paper--Code Alignment

arXiv - AI · 3 min · about 10 hours ago

Machine Learning

[2509.21385] Debugging Concept Bottleneck Models through Removal and Retraining

Abstract page for arXiv paper 2509.21385: Debugging Concept Bottleneck Models through Removal and Retraining

arXiv - Machine Learning · 4 min · about 10 hours ago

Llms

[2512.00804] Epistemic Bias Injection: Biasing LLMs via Selective Context Retrieval

Abstract page for arXiv paper 2512.00804: Epistemic Bias Injection: Biasing LLMs via Selective Context Retrieval

arXiv - AI · 4 min · about 10 hours ago

[2603.23889] Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration

About this article

Related Articles

Washington needs AI guardrails — now | Opinion

[2601.12910] SciCoQA: Quality Assurance for Scientific Paper--Code Alignment

[2509.21385] Debugging Concept Bottleneck Models through Removal and Retraining

[2512.00804] Epistemic Bias Injection: Biasing LLMs via Selective Context Retrieval

No comments

Stay updated with AI News