[2410.02605] Policy Gradients for Cumulative Prospect Theory in Reinforcement Learning
Summary
This paper presents a policy gradient theorem for Cumulative Prospect Theory (CPT) in reinforcement learning, introducing a new algorithm that enhances risk assessment in decision-making processes.
Why It Matters
Understanding how Cumulative Prospect Theory can be applied to reinforcement learning is crucial for developing algorithms that better mimic human decision-making under risk. This research bridges behavioral economics and machine learning, potentially leading to more effective AI systems in uncertain environments.
Key Takeaways
- Introduces a policy gradient theorem for Cumulative Prospect Theory in RL.
- Develops a first-order policy gradient algorithm using Monte Carlo methods.
- Establishes statistical guarantees for the proposed algorithm.
- Demonstrates asymptotic convergence to stationary points of the CPT objective.
- Compares the new approach with existing zeroth-order methods through simulations.
Computer Science > Machine Learning arXiv:2410.02605 (cs) [Submitted on 3 Oct 2024 (v1), last revised 17 Feb 2026 (this version, v4)] Title:Policy Gradients for Cumulative Prospect Theory in Reinforcement Learning Authors:Olivier Lepel, Anas Barakat View a PDF of the paper titled Policy Gradients for Cumulative Prospect Theory in Reinforcement Learning, by Olivier Lepel and 1 other authors View PDF HTML (experimental) Abstract:We derive a policy gradient theorem for Cumulative Prospect Theory (CPT) objectives in finite-horizon Reinforcement Learning (RL), generalizing the standard policy gradient theorem and encompassing distortion-based risk objectives as special cases. Motivated by behavioral economics, CPT combines an asymmetric utility transformation around a reference point with probability distortion. Building on our theorem, we design a first-order policy gradient algorithm for CPT-RL using a Monte Carlo gradient estimator based on order statistics. We establish statistical guarantees for the estimator and prove asymptotic convergence of the resulting algorithm to first-order stationary points of the (generally non-convex) CPT objective. Simulations illustrate qualitative behaviors induced by CPT and compare our first-order approach to existing zeroth-order methods. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.02605 [cs.LG] (or arXiv:2410.02605v4 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.02605 Focus ...