[2405.08253] Thompson Sampling for Infinite-Horizon Discounted Decision Processes
About this article
Abstract page for arXiv paper 2405.08253: Thompson Sampling for Infinite-Horizon Discounted Decision Processes
Statistics > Machine Learning arXiv:2405.08253 (stat) [Submitted on 14 May 2024 (v1), last revised 8 Apr 2026 (this version, v3)] Title:Thompson Sampling for Infinite-Horizon Discounted Decision Processes Authors:Daniel Adelman, Cagla Keceli, Alba V. Olivares-Nadal View a PDF of the paper titled Thompson Sampling for Infinite-Horizon Discounted Decision Processes, by Daniel Adelman and 2 other authors View PDF Abstract:This paper develops a viable notion of learning for sampling-based algorithms that applies in broader settings than previously considered. More specifically, we model a discounted infinite-horizon MDPs with Borel state and action spaces, whose rewards and transitions depend on an unknown parameter. To analyze adaptive learning algorithms based on sampling we introduce a general canonical probability space in this setting. Since standard definitions of regret are inadequate for policy evaluation in this setting, we propose new metrics that arise from decomposing the standard expected regret in discounted infinite-horizon MDPs into three terms: (i) the expected finite-time regret, (ii) the expected state regret, and (iii) the expected residual regret. Component (i) translates into the traditional concept of expected regret over a finite horizon. Term (ii) reflects how much future performance is compromised at a given time because earlier decisions have led the system to a less favorable state than under an optimal policy. Finally, metric (iii) measures regret ...