[2510.23868] GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA
About this article
Abstract page for arXiv paper 2510.23868: GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA
Computer Science > Machine Learning arXiv:2510.23868 (cs) [Submitted on 27 Oct 2025 (v1), last revised 8 Apr 2026 (this version, v4)] Title:GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA Authors:Zhichao Wang View a PDF of the paper titled GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA, by Zhichao Wang View PDF HTML (experimental) Abstract:This paper proposes \textit{Group-relative Implicit Fine-Tuning (GIFT)}, a reinforcement learning framework for aligning large language models (LLMs) that unifies on-policy optimization with implicit preference learning. GIFT combines three key elements: (1) group-based sampling and normalization from GRPO, (2) the implicit reward formulation of DPO, and (3) the training principle underlying UNA. The central idea is to transform reward maximization into a \textit{group-wise reward matching problem}. By jointly normalizing implicit and explicit rewards within each sampled group, GIFT eliminates the intractable normalization constant associated with implicit rewards and reduces sensitivity to the KL-regularization coefficient through normalization. This yields a simple mean squared error (MSE) objective between normalized implicit and explicit reward functions, providing a stable and analytically tractable training signal. Unlike offline approaches such as DPO and UNA, GIFT retains on-policy exploration through on-policy response sampling. Compared to GRPO, it replaces high-variance re...