[2602.16274] Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains
Summary
This paper presents a high-probability regret bound for online Q-learning in infinite-horizon discounted Markov decision processes, analyzing its performance under varying suboptimality gaps.
Why It Matters
Understanding regret bounds in online Q-learning is crucial for improving algorithm efficiency in reinforcement learning. This research addresses limitations of existing methods and offers new exploration strategies, making it relevant for advancing machine learning applications.
Key Takeaways
- Introduces a high-probability regret bound for online Q-learning.
- Analyzes the impact of suboptimality gaps on regret growth.
- Proposes a Smoothed ε_n-Greedy exploration scheme for improved performance.
- Develops a concentration bound for contractive Markovian stochastic approximation.
- Highlights the significance of mixing time in convergence analysis.
Computer Science > Machine Learning arXiv:2602.16274 (cs) [Submitted on 18 Feb 2026] Title:Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains Authors:Rahul Singh, Siddharth Chandak, Eric Moulines, Vivek S. Borkar, Nicholas Bambos View a PDF of the paper titled Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains, by Rahul Singh and 3 other authors View PDF HTML (experimental) Abstract:We present the first high-probability regret bound for classical online Q-learning in infinite-horizon discounted Markov decision processes, without relying on optimism or bonus terms. We first analyze Boltzmann Q-learning with decaying temperature and show that its regret depends critically on the suboptimality gap of the MDP: for sufficiently large gaps, the regret is sublinear, while for small gaps it deteriorates and can approach linear growth. To address this limitation, we study a Smoothed $\epsilon_n$-Greedy exploration scheme that combines $\epsilon_n$-greedy and Boltzmann exploration, for which we prove a gap-robust regret bound of near-$\tilde{O}(N^{9/10})$. To analyze these algorithms, we develop a high-probability concentration bound for contractive Markovian stochastic approximation with iterate- and time-dependent transition dynamics. This bound may be of independent interest as the contraction factor in our bound is ...