[2602.17978] Learning Optimal and Sample-Efficient Decision Policies with Guarantees
Summary
This paper presents a novel approach to learning optimal and sample-efficient decision policies in reinforcement learning, addressing challenges posed by hidden confounders and improving sample efficiency in high-stakes applications.
Why It Matters
The research tackles significant barriers in reinforcement learning, particularly in high-stakes environments where traditional methods are impractical. By focusing on offline learning and the influence of hidden confounders, this work has implications for various fields, including robotics and healthcare, where decision-making must be both efficient and reliable.
Key Takeaways
- Introduces a sample-efficient algorithm for learning from offline datasets with hidden confounders.
- Adapts causal inference techniques to improve decision-making in reinforcement learning.
- Demonstrates improved sample efficiency for learning high-level objectives using linear temporal logic.
Computer Science > Machine Learning arXiv:2602.17978 (cs) [Submitted on 20 Feb 2026] Title:Learning Optimal and Sample-Efficient Decision Policies with Guarantees Authors:Daqian Shao View a PDF of the paper titled Learning Optimal and Sample-Efficient Decision Policies with Guarantees, by Daqian Shao View PDF Abstract:The paradigm of decision-making has been revolutionised by reinforcement learning and deep learning. Although this has led to significant progress in domains such as robotics, healthcare, and finance, the use of RL in practice is challenging, particularly when learning decision policies in high-stakes applications that may require guarantees. Traditional RL algorithms rely on a large number of online interactions with the environment, which is problematic in scenarios where online interactions are costly, dangerous, or infeasible. However, learning from offline datasets is hindered by the presence of hidden confounders. Such confounders can cause spurious correlations in the dataset and can mislead the agent into taking suboptimal or adversarial actions. Firstly, we address the problem of learning from offline datasets in the presence of hidden confounders. We work with instrumental variables (IVs) to identify the causal effect, which is an instance of a conditional moment restrictions (CMR) problem. Inspired by double/debiased machine learning, we derive a sample-efficient algorithm for solving CMR problems with convergence and optimality guarantees, which o...