[2603.23889] Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration

[2603.23889] Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration

arXiv - Machine Learning 3 min read

About this article

Abstract page for arXiv paper 2603.23889: Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration

Computer Science > Machine Learning arXiv:2603.23889 (cs) [Submitted on 25 Mar 2026] Title:Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration Authors:Guopeng Li, Matthijs T.J. Spaan, Julian F.P. Kooij View a PDF of the paper titled Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration, by Guopeng Li and 2 other authors View PDF HTML (experimental) Abstract:When safety is formulated as a limit of cumulative cost, safe reinforcement learning (RL) aims to learn policies that maximize return subject to the cost constraint in data collection and deployment. Off-policy safe RL methods, although offering high sample efficiency, suffer from constraint violations due to cost-agnostic exploration and estimation bias in cumulative cost. To address this issue, we propose Constrained Optimistic eXploration Q-learning (COX-Q), an off-policy safe RL algorithm that integrates cost-bounded online exploration and conservative offline distributional value learning. First, we introduce a novel cost-constrained optimistic exploration strategy that resolves gradient conflicts between reward and cost in the action space and adaptively adjusts the trust region to control the training cost. Second, we adopt truncated quantile critics to stabilize the cost value learning. Quantile critics also quantify epistemic uncertainty to guide exploration. Experiments on safe velocity, safe navigation, and autonomous driving tasks demonstrate that COX-Q ach...

Originally published on March 26, 2026. Curated by AI News.

Related Articles

Washington needs AI guardrails — now | Opinion
Ai Safety

Washington needs AI guardrails — now | Opinion

We need legislation that draws clear lines on what AI systems may and may not do on behalf of the United States government

AI Tools & Products · 3 min ·
[2601.12910] SciCoQA: Quality Assurance for Scientific Paper--Code Alignment
Ai Safety

[2601.12910] SciCoQA: Quality Assurance for Scientific Paper--Code Alignment

Abstract page for arXiv paper 2601.12910: SciCoQA: Quality Assurance for Scientific Paper--Code Alignment

arXiv - AI · 3 min ·
[2509.21385] Debugging Concept Bottleneck Models through Removal and Retraining
Machine Learning

[2509.21385] Debugging Concept Bottleneck Models through Removal and Retraining

Abstract page for arXiv paper 2509.21385: Debugging Concept Bottleneck Models through Removal and Retraining

arXiv - Machine Learning · 4 min ·
[2512.00804] Epistemic Bias Injection: Biasing LLMs via Selective Context Retrieval
Llms

[2512.00804] Epistemic Bias Injection: Biasing LLMs via Selective Context Retrieval

Abstract page for arXiv paper 2512.00804: Epistemic Bias Injection: Biasing LLMs via Selective Context Retrieval

arXiv - AI · 4 min ·
More in Ai Safety: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime