[2603.00945] Non-Rectangular Average-Reward Robust MDPs: Non-Rectangular Average-Reward Robust MDPs:Optimal Policies and Their Transient Values

[2603.00945] Non-Rectangular Average-Reward Robust MDPs: Non-Rectangular Average-Reward Robust MDPs:Optimal Policies and Their Transient Values

arXiv - Machine Learning 4 min read

About this article

Abstract page for arXiv paper 2603.00945: Non-Rectangular Average-Reward Robust MDPs: Non-Rectangular Average-Reward Robust MDPs:Optimal Policies and Their Transient Values

Mathematics > Optimization and Control arXiv:2603.00945 (math) [Submitted on 1 Mar 2026] Title:Non-Rectangular Average-Reward Robust MDPs: Non-Rectangular Average-Reward Robust MDPs:Optimal Policies and Their Transient Values Authors:Shengbo wang, Nian Si View a PDF of the paper titled Non-Rectangular Average-Reward Robust MDPs: Non-Rectangular Average-Reward Robust MDPs:Optimal Policies and Their Transient Values, by Shengbo wang and 1 other authors View PDF HTML (experimental) Abstract:We study non-rectangular robust Markov decision processes under the average-reward criterion, where the ambiguity set couples transition probabilities across states and the adversary commits to a stationary kernel for the entire horizon. We show that any history-dependent policy achieving sublinear expected regret uniformly over the ambiguity set is robust-optimal, and that the robust value admits a minimax representation as the infimum over the ambiguity set of the classical optimal gains, without requiring any form of rectangularity or robust dynamic programming principle. Under the weak communication assumption, we establish the existence of such policies by converting high-probability regret bounds from the average-reward reinforcement learning literature into the expected-regret criterion. We then introduce a transient-value framework to evaluate finite-time performance of robust optimal policies, proving that average-reward optimality alone can mask arbitrarily poor transients and de...

Originally published on March 03, 2026. Curated by AI News.

Related Articles

[2603.14841] Real-Time Driver Safety Scoring Through Inverse Crash Probability Modeling
Machine Learning

[2603.14841] Real-Time Driver Safety Scoring Through Inverse Crash Probability Modeling

Abstract page for arXiv paper 2603.14841: Real-Time Driver Safety Scoring Through Inverse Crash Probability Modeling

arXiv - AI · 4 min ·
[2603.17839] How do LLMs Compute Verbal Confidence
Llms

[2603.17839] How do LLMs Compute Verbal Confidence

Abstract page for arXiv paper 2603.17839: How do LLMs Compute Verbal Confidence

arXiv - AI · 4 min ·
[2603.15970] 100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models
Llms

[2603.15970] 100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models

Abstract page for arXiv paper 2603.15970: 100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight...

arXiv - AI · 4 min ·
[2603.09085] Not All News Is Equal: Topic- and Event-Conditional Sentiment from Finetuned LLMs for Aluminum Price Forecasting
Llms

[2603.09085] Not All News Is Equal: Topic- and Event-Conditional Sentiment from Finetuned LLMs for Aluminum Price Forecasting

Abstract page for arXiv paper 2603.09085: Not All News Is Equal: Topic- and Event-Conditional Sentiment from Finetuned LLMs for Aluminum ...

arXiv - AI · 4 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime