[2505.20065] SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety
About this article
Abstract page for arXiv paper 2505.20065: SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety
Computer Science > Machine Learning arXiv:2505.20065 (cs) [Submitted on 26 May 2025 (v1), last revised 4 Mar 2026 (this version, v2)] Title:SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety Authors:Geon-Hyeong Kim, Yu Jin Kim, Byoungjip Kim, Honglak Lee, Kyunghoon Bae, Youngsoo Jang, Moontae Lee View a PDF of the paper titled SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety, by Geon-Hyeong Kim and 6 other authors View PDF HTML (experimental) Abstract:As Large Language Models (LLMs) are increasingly deployed in real-world applications, balancing helpfulness and safety has become a central challenge. A natural approach is to incorporate safety constraints into Reinforcement Learning from Human Feedback (RLHF), where recent studies have shown promising progress. However, these methods often rely on auxiliary networks or multi-stage pipelines, thereby increasing complexity. In this work, we revisit the original safety alignment objective and show that, under mild assumptions, it admits a closed-form optimal policy. We further derive a provably equivalent and tractable objective, enabling direct optimization. Building on this insight, we propose SafeDPO, a lightweight method that preserves the optimal solution of the underlying safety-constrained objective while requiring only one additional hyperparameter and minimal modifications to existing preference-based training methods. SafeDPO eliminates the need for rewa...