[2603.01563] LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models
About this article
Abstract page for arXiv paper 2603.01563: LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models
Computer Science > Machine Learning arXiv:2603.01563 (cs) [Submitted on 2 Mar 2026] Title:LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models Authors:Chenxing Wei, Jiazhen Kang, Hong Wang, Jianqing Zhang, Hao Jiang, Xiaolong Xu, Ningyuan Sun, Ying He, F. Richard Yu, Yao Shu, Bo Jiang View a PDF of the paper titled LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models, by Chenxing Wei and 10 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has achieved remarkable success in improving autoregressive models, especially in domains requiring correctness like mathematical reasoning and code generation. However, directly applying such paradigms to Diffusion Large Language Models (dLLMs) is fundamentally hindered by the intractability of exact likelihood computation, which forces existing methods to rely on high-variance approximations. To bridge this gap, we propose Likelihood-Free Policy Optimization (LFPO), a native framework that maps the concept of vector field flow matching to the discrete token space. Specifically, LFPO formulates alignment as geometric velocity rectification, which directly optimizes denoising logits via contrastive updates. This design effectively bypasses the errors inherent in likelihood approximation, yielding the precise gradient estimation. Furthermore, LFPO enforce consistency by predicting final solutions from intermediate steps, effectively straightening the...