[2603.23926] Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs
About this article
Abstract page for arXiv paper 2603.23926: Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs
Computer Science > Machine Learning arXiv:2603.23926 (cs) [Submitted on 25 Mar 2026] Title:Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs Authors:Guy Zamir, Matthew Zurek, Yudong Chen View a PDF of the paper titled Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs, by Guy Zamir and 2 other authors View PDF HTML (experimental) Abstract:Online reinforcement learning in infinite-horizon Markov decision processes (MDPs) remains less theoretically and algorithmically developed than its episodic counterpart, with many algorithms suffering from high ``burn-in'' costs and failing to adapt to benign instance-specific complexity. In this work, we address these shortcomings for two infinite-horizon objectives: the classical average-reward regret and the $\gamma$-regret. We develop a single tractable UCB-style algorithm applicable to both settings, which achieves the first optimal variance-dependent regret guarantees. Our regret bounds in both settings take the form $\tilde{O}( \sqrt{SA\,\text{Var}} + \text{lower-order terms})$, where $S,A$ are the state and action space sizes, and $\text{Var}$ captures cumulative transition variance. This implies minimax-optimal average-reward and $\gamma$-regret bounds in the worst case but also adapts to easier problem instances, for example yielding nearly constant regret in deterministic MDPs. Furthermore, our algorithm enjoys significantly improved lower-order terms for the average-reward setting. With prior ...