[2502.11026] RLHF in an SFT Way: From Optimal Solution to Reward-Weighted Alignment
About this article
Abstract page for arXiv paper 2502.11026: RLHF in an SFT Way: From Optimal Solution to Reward-Weighted Alignment
Computer Science > Machine Learning arXiv:2502.11026 (cs) [Submitted on 16 Feb 2025 (v1), last revised 21 Mar 2026 (this version, v3)] Title:RLHF in an SFT Way: From Optimal Solution to Reward-Weighted Alignment Authors:Yuhao Du, Zhuo Li, Pengyu Cheng, Zhihong Chen, Yuejiao Xie, Xiang Wan, Anningzhe Gao View a PDF of the paper titled RLHF in an SFT Way: From Optimal Solution to Reward-Weighted Alignment, by Yuhao Du and 6 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning Large Language Models (LLMs) with human values. However, RLHF has been continuously challenged by its high complexity in implementation and computation consumption, specifically for online sampling-based methods like Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO). Even with recent simplifications, such as Direct Preference Optimization (DPO) that designs an offline implicit reward learning objective relying on pre-collected preference datasets, the problems of over-fitting and training instability remain hindering the alignment process from the expected optimal performance. To address the existing challenges, we propose a novel simplification of RLHF from the perspective of variational inference, called Variational Alignment with Re-weighting (VAR). Specifically, by directly minimizing the distribution gap between the learning LLM policy and the optimal solution of RLHF, we transform the alignment ...