[2602.16274] Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains

[2602.16274] Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains

arXiv - Machine Learning 3 min read Article

Summary

This paper presents a high-probability regret bound for online Q-learning in infinite-horizon discounted Markov decision processes, analyzing its performance under varying suboptimality gaps.

Why It Matters

Understanding regret bounds in online Q-learning is crucial for improving algorithm efficiency in reinforcement learning. This research addresses limitations of existing methods and offers new exploration strategies, making it relevant for advancing machine learning applications.

Key Takeaways

  • Introduces a high-probability regret bound for online Q-learning.
  • Analyzes the impact of suboptimality gaps on regret growth.
  • Proposes a Smoothed ε_n-Greedy exploration scheme for improved performance.
  • Develops a concentration bound for contractive Markovian stochastic approximation.
  • Highlights the significance of mixing time in convergence analysis.

Computer Science > Machine Learning arXiv:2602.16274 (cs) [Submitted on 18 Feb 2026] Title:Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains Authors:Rahul Singh, Siddharth Chandak, Eric Moulines, Vivek S. Borkar, Nicholas Bambos View a PDF of the paper titled Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains, by Rahul Singh and 3 other authors View PDF HTML (experimental) Abstract:We present the first high-probability regret bound for classical online Q-learning in infinite-horizon discounted Markov decision processes, without relying on optimism or bonus terms. We first analyze Boltzmann Q-learning with decaying temperature and show that its regret depends critically on the suboptimality gap of the MDP: for sufficiently large gaps, the regret is sublinear, while for small gaps it deteriorates and can approach linear growth. To address this limitation, we study a Smoothed $\epsilon_n$-Greedy exploration scheme that combines $\epsilon_n$-greedy and Boltzmann exploration, for which we prove a gap-robust regret bound of near-$\tilde{O}(N^{9/10})$. To analyze these algorithms, we develop a high-probability concentration bound for contractive Markovian stochastic approximation with iterate- and time-dependent transition dynamics. This bound may be of independent interest as the contraction factor in our bound is ...

Related Articles

Machine Learning

[For Hire] Ex-Microsoft Senior Data Engineer | Databricks, Palantir Foundry, MLOps | $55/hr

submitted by /u/mcheetirala2510 [link] [comments]

Reddit - ML Jobs · 1 min ·
Meta AI app climbs to No. 5 on the App Store after Muse Spark launch | TechCrunch
Machine Learning

Meta AI app climbs to No. 5 on the App Store after Muse Spark launch | TechCrunch

The app was ranking No. 57 on the App Store just before Meta AI's new model launched. Now it's No. 5 — and rising.

TechCrunch - AI · 4 min ·
Machine Learning

Detecting mirrored selfie images: OCR the best way? [D]

I'm trying to catch backwards "selfie" images before passing them to our VLM text reader and/or face embedding extraction. Since models l...

Reddit - Machine Learning · 1 min ·
Llms

Google’s Gemini AI can answer your questions with 3D models and simulations

submitted by /u/tekz [link] [comments]

Reddit - Artificial Intelligence · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime