[2410.02605] Policy Gradients for Cumulative Prospect Theory in Reinforcement Learning

[2410.02605] Policy Gradients for Cumulative Prospect Theory in Reinforcement Learning

arXiv - AI 3 min read Article

Summary

This paper presents a policy gradient theorem for Cumulative Prospect Theory (CPT) in reinforcement learning, introducing a new algorithm that enhances risk assessment in decision-making processes.

Why It Matters

Understanding how Cumulative Prospect Theory can be applied to reinforcement learning is crucial for developing algorithms that better mimic human decision-making under risk. This research bridges behavioral economics and machine learning, potentially leading to more effective AI systems in uncertain environments.

Key Takeaways

  • Introduces a policy gradient theorem for Cumulative Prospect Theory in RL.
  • Develops a first-order policy gradient algorithm using Monte Carlo methods.
  • Establishes statistical guarantees for the proposed algorithm.
  • Demonstrates asymptotic convergence to stationary points of the CPT objective.
  • Compares the new approach with existing zeroth-order methods through simulations.

Computer Science > Machine Learning arXiv:2410.02605 (cs) [Submitted on 3 Oct 2024 (v1), last revised 17 Feb 2026 (this version, v4)] Title:Policy Gradients for Cumulative Prospect Theory in Reinforcement Learning Authors:Olivier Lepel, Anas Barakat View a PDF of the paper titled Policy Gradients for Cumulative Prospect Theory in Reinforcement Learning, by Olivier Lepel and 1 other authors View PDF HTML (experimental) Abstract:We derive a policy gradient theorem for Cumulative Prospect Theory (CPT) objectives in finite-horizon Reinforcement Learning (RL), generalizing the standard policy gradient theorem and encompassing distortion-based risk objectives as special cases. Motivated by behavioral economics, CPT combines an asymmetric utility transformation around a reference point with probability distortion. Building on our theorem, we design a first-order policy gradient algorithm for CPT-RL using a Monte Carlo gradient estimator based on order statistics. We establish statistical guarantees for the estimator and prove asymptotic convergence of the resulting algorithm to first-order stationary points of the (generally non-convex) CPT objective. Simulations illustrate qualitative behaviors induced by CPT and compare our first-order approach to existing zeroth-order methods. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.02605 [cs.LG]   (or arXiv:2410.02605v4 [cs.LG] for this version)   https://doi.org/10.48550/arXiv.2410.02605 Focus ...

Related Articles

Llms

What's your "When Language Model AI can do X, I'll be impressed"?

I have two at the top of my mind: When it can read musical notes. I will be mildly impressed when I can paste in a picture of musical not...

Reddit - Artificial Intelligence · 1 min ·
Meta’s New AI Asked for My Raw Health Data—and Gave Me Terrible Advice | WIRED
Machine Learning

Meta’s New AI Asked for My Raw Health Data—and Gave Me Terrible Advice | WIRED

Meta’s Muse Spark model offers to analyze users’ health data, including lab results. Beyond the obvious privacy risks, it’s not a capable...

Wired - AI · 9 min ·
Machine Learning

What image/video training data is hardest to find right now? [R]

I'm building a crowdsourced photo collection platform (contributors take photos with smartphones, we auto-label with YOLO/CLIP + enrich w...

Reddit - Machine Learning · 1 min ·
Machine Learning

I implemented DPO from the paper and the reward margin hit 599 here's what that actually means [R]

DPO (Rafailov et al., NeurIPS 2023) is supposed to be the clean alternative to PPO. No reward model in the training loop, no value functi...

Reddit - Machine Learning · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime