Machine Learning Ai Agents

[2602.16274] Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains

arXiv - Machine Learning February 19, 2026 3 min read Article

Summary

This paper presents a high-probability regret bound for online Q-learning in infinite-horizon discounted Markov decision processes, analyzing its performance under varying suboptimality gaps.

Why It Matters

Understanding regret bounds in online Q-learning is crucial for improving algorithm efficiency in reinforcement learning. This research addresses limitations of existing methods and offers new exploration strategies, making it relevant for advancing machine learning applications.

Key Takeaways

Introduces a high-probability regret bound for online Q-learning.
Analyzes the impact of suboptimality gaps on regret growth.
Proposes a Smoothed ε_n-Greedy exploration scheme for improved performance.
Develops a concentration bound for contractive Markovian stochastic approximation.
Highlights the significance of mixing time in convergence analysis.

Computer Science > Machine Learning arXiv:2602.16274 (cs) [Submitted on 18 Feb 2026] Title:Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains Authors:Rahul Singh, Siddharth Chandak, Eric Moulines, Vivek S. Borkar, Nicholas Bambos View a PDF of the paper titled Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains, by Rahul Singh and 3 other authors View PDF HTML (experimental) Abstract:We present the first high-probability regret bound for classical online Q-learning in infinite-horizon discounted Markov decision processes, without relying on optimism or bonus terms. We first analyze Boltzmann Q-learning with decaying temperature and show that its regret depends critically on the suboptimality gap of the MDP: for sufficiently large gaps, the regret is sublinear, while for small gaps it deteriorates and can approach linear growth. To address this limitation, we study a Smoothed $\epsilon_n$-Greedy exploration scheme that combines $\epsilon_n$-greedy and Boltzmann exploration, for which we prove a gap-robust regret bound of near-$\tilde{O}(N^{9/10})$. To analyze these algorithms, we develop a high-probability concentration bound for contractive Markovian stochastic approximation with iterate- and time-dependent transition dynamics. This bound may be of independent interest as the contraction factor in our bound is ...

Read Original Article

[2602.16274] Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains

Summary

Why It Matters

Key Takeaways

Related Articles

[For Hire] Ex-Microsoft Senior Data Engineer | Databricks, Palantir Foundry, MLOps | $55/hr

Meta AI app climbs to No. 5 on the App Store after Muse Spark launch | TechCrunch

Detecting mirrored selfie images: OCR the best way? [D]

Google’s Gemini AI can answer your questions with 3D models and simulations

No comments

Stay updated with AI News