[2505.19731] Proximal Point Nash Learning from Human Feedback
About this article
Abstract page for arXiv paper 2505.19731: Proximal Point Nash Learning from Human Feedback
Statistics > Machine Learning arXiv:2505.19731 (stat) [Submitted on 26 May 2025 (v1), last revised 22 Mar 2026 (this version, v2)] Title:Proximal Point Nash Learning from Human Feedback Authors:Daniil Tiapkin, Daniele Calandriello, Denis Belomestny, Eric Moulines, Alexey Naumov, Kashif Rasul, Michal Valko, Pierre Menard View a PDF of the paper titled Proximal Point Nash Learning from Human Feedback, by Daniil Tiapkin and 7 other authors View PDF HTML (experimental) Abstract:Traditional Reinforcement Learning from Human Feedback (RLHF) often relies on reward models, frequently assuming preference structures like the Bradley--Terry model, which may not accurately capture the complexities of real human preferences (e.g., intransitivity). Nash Learning from Human Feedback (NLHF) offers a more direct alternative by framing the problem as finding a Nash equilibrium of a game defined by these preferences. While many works study the Nash learning problem directly in the policy space, we instead consider it under a more realistic policy parametrization setting. We first analyze a simple self-play policy gradient method, which is equivalent to Online IPO. We establish high-probability last-iterate convergence guarantees for this method, but our analysis also reveals a possible stability limitation of the underlying dynamics. Motivated by this, we embed the self-play updates into a proximal point framework, yielding a stabilized algorithm. For this combined method, we prove high-prob...