[2411.11727] Aligning Few-Step Diffusion Models with Dense Reward Difference Learning
Summary
This paper presents Stepwise Diffusion Policy Optimization (SDPO), a novel reinforcement learning framework designed to enhance few-step diffusion models for high-resolution image synthesis, addressing limitations in existing methods.
Why It Matters
The research is significant as it proposes a new approach to improve the efficiency and effectiveness of few-step diffusion models, which are increasingly relevant in generative AI applications. By optimizing reward feedback mechanisms, it aims to enhance image synthesis quality, which is crucial for various practical applications in computer vision.
Key Takeaways
- SDPO introduces a dual-state trajectory sampling mechanism for better reward feedback.
- The framework minimizes costly dense reward queries through a latent similarity-based strategy.
- Improvements in long-term dependency and gradient stability are achieved with stepwise advantage estimates.
- SDPO consistently outperforms existing methods across diverse tasks.
- The code for SDPO is made publicly available, promoting further research and application.
Computer Science > Machine Learning arXiv:2411.11727 (cs) [Submitted on 18 Nov 2024 (v1), last revised 26 Feb 2026 (this version, v2)] Title:Aligning Few-Step Diffusion Models with Dense Reward Difference Learning Authors:Ziyi Zhang, Li Shen, Sen Zhang, Deheng Ye, Yong Luo, Miaojing Shi, Dongjing Shan, Bo Du, Dacheng Tao View a PDF of the paper titled Aligning Few-Step Diffusion Models with Dense Reward Difference Learning, by Ziyi Zhang and 8 other authors View PDF HTML (experimental) Abstract:Few-step diffusion models enable efficient high-resolution image synthesis but struggle to align with specific downstream objectives due to limitations of existing reinforcement learning (RL) methods in low-step regimes with limited state spaces and suboptimal sample quality. To address this, we propose Stepwise Diffusion Policy Optimization (SDPO), a novel RL framework tailored for few-step diffusion models. SDPO introduces a dual-state trajectory sampling mechanism, tracking both noisy and predicted clean states at each step to provide dense reward feedback and enable low-variance, mixed-step optimization. For further efficiency, we develop a latent similarity-based dense reward prediction strategy to minimize costly dense reward queries. Leveraging these dense rewards, SDPO optimizes a dense reward difference learning objective that enables more frequent and granular policy updates. Additional refinements, including stepwise advantage estimates, temporal importance weighting, and...