[2502.06051] Towards a Sharp Analysis of Offline Policy Learning for $f$-Divergence-Regularized Contextual Bandits
Summary
This paper presents a detailed analysis of offline policy learning in contextual bandits, focusing on $f$-divergence regularization and its sample complexity requirements.
Why It Matters
Understanding the sample complexity of offline reinforcement learning algorithms is crucial for improving their efficiency and effectiveness. This research provides new insights into the conditions necessary for achieving optimal performance in contextual bandits, which can influence future developments in machine learning and AI applications.
Key Takeaways
- The paper establishes tighter sample complexity bounds for offline $f$-divergence-regularized contextual bandits.
- A novel pessimism-based analysis achieves an $ ilde{O}(rac{1}{ ext{ε}})$ sample complexity under single-policy concentrability.
- The results demonstrate the necessity of a multiplicative dependency on single-policy concentrability for optimal performance.
- For strongly convex $f$-divergences, sharp sample complexity can be achieved without pessimistic estimation.
- Numerical experiments corroborate theoretical findings and extend analysis to contextual dueling bandits.
Computer Science > Machine Learning arXiv:2502.06051 (cs) [Submitted on 9 Feb 2025 (v1), last revised 26 Feb 2026 (this version, v3)] Title:Towards a Sharp Analysis of Offline Policy Learning for $f$-Divergence-Regularized Contextual Bandits Authors:Qingyue Zhao, Kaixuan Ji, Heyang Zhao, Tong Zhang, Quanquan Gu View a PDF of the paper titled Towards a Sharp Analysis of Offline Policy Learning for $f$-Divergence-Regularized Contextual Bandits, by Qingyue Zhao and 4 other authors View PDF HTML (experimental) Abstract:Many offline reinforcement learning algorithms are underpinned by $f$-divergence regularization, but their sample complexity *defined with respect to regularized objectives* still lacks tight analyses, especially in terms of concrete data coverage conditions. In this paper, we study the exact concentrability requirements to achieve the $\tilde{\Theta}(\epsilon^{-1})$ sample complexity for offline $f$-divergence-regularized contextual bandits. For reverse Kullback-Leibler (KL) divergence, arguably the most commonly used one, we achieve an $\tilde{O}(\epsilon^{-1})$ sample complexity under single-policy concentrability for the first time via a novel pessimism-based analysis, surpassing existing $\tilde{O}(\epsilon^{-1})$ bound under all-policy concentrability and $\tilde{O}(\epsilon^{-2})$ bound under single-policy concentrability. We also propose a near-matching lower bound, demonstrating that a multiplicative dependency on single-policy concentrability is necess...