[2503.17338] Capturing Individual Human Preferences with Reward Features
Summary
The paper discusses a new approach to modeling individual human preferences in reinforcement learning, emphasizing the need for adaptive reward models that account for user diversity.
Why It Matters
Understanding individual preferences is crucial in AI applications, especially in training large language models. This research addresses the limitations of traditional methods that do not consider user variability, potentially leading to more personalized and effective AI systems.
Key Takeaways
- Traditional reward functions in reinforcement learning often ignore individual user preferences.
- The proposed adaptive reward model can better capture diverse human feedback.
- Empirical risk minimization is used to derive a PAC bound for approximation error.
- The model's effectiveness increases with the number of raters and preference heterogeneity.
- Experiments demonstrate the advantages of the adaptive model over non-adaptive counterparts.
Computer Science > Artificial Intelligence arXiv:2503.17338 (cs) [Submitted on 21 Mar 2025 (v1), last revised 19 Feb 2026 (this version, v2)] Title:Capturing Individual Human Preferences with Reward Features Authors:André Barreto, Vincent Dumoulin, Yiran Mao, Mark Rowland, Nicolas Perez-Nieves, Bobak Shahriari, Yann Dauphin, Doina Precup, Hugo Larochelle View a PDF of the paper titled Capturing Individual Human Preferences with Reward Features, by Andr\'e Barreto and 8 other authors View PDF HTML (experimental) Abstract:Reinforcement learning from human feedback usually models preferences using a reward function that does not distinguish between people. We argue that this is unlikely to be a good design choice in contexts with high potential for disagreement, like in the training of large language models. We formalise and analyse the problem of learning a reward model that can be specialised to a user. Using the principle of empirical risk minimisation, we derive a probably approximately correct (PAC) bound showing the dependency of the approximation error on the number of training examples, as usual, and also on the number of human raters who provided feedback on them. Based on our theoretical findings, we discuss how to best collect pairwise preference data and argue that adaptive reward models should be beneficial when there is considerable disagreement among users. We also propose a concrete architecture for an adaptive reward model. Our approach leverages the observat...