[2502.02088] Dual-IPO: Dual-Iterative Preference Optimization for Text-to-Video Generation
Summary
The paper presents Dual-IPO, a novel framework for optimizing text-to-video generation by iteratively improving both the reward and video generation models to enhance output quality and user preference alignment.
Why It Matters
As video generation technology advances, ensuring that outputs meet user expectations is crucial. Dual-IPO addresses this by refining the generation process through a dual-iterative approach, potentially transforming how AI-generated videos align with human preferences and improving overall synthesis quality.
Key Takeaways
- Dual-IPO optimizes video generation through a dual-iterative process.
- The framework enhances synthesis quality by aligning outputs with user preferences.
- It utilizes CoT-guided reasoning and voting-based self-consistency for robust reward signals.
- Experiments show significant improvements in video quality, even with smaller models.
- The approach eliminates the need for extensive manual preference annotations.
Computer Science > Computer Vision and Pattern Recognition arXiv:2502.02088 (cs) [Submitted on 4 Feb 2025 (v1), last revised 26 Feb 2026 (this version, v5)] Title:Dual-IPO: Dual-Iterative Preference Optimization for Text-to-Video Generation Authors:Xiaomeng Yang, Mengping Yang, Jia Gong, Luozheng Qin, Zhiyu Tan, Hao Li View a PDF of the paper titled Dual-IPO: Dual-Iterative Preference Optimization for Text-to-Video Generation, by Xiaomeng Yang and 5 other authors View PDF HTML (experimental) Abstract:Recent advances in video generation have enabled thrilling experiences in producing realistic videos driven by scalable diffusion transformers. However, they usually fail to produce satisfactory outputs that are aligned to users' authentic demands and preferences. In this work, we introduce Dual-Iterative Optimization (Dual-IPO), an iterative paradigm that sequentially optimizes both the reward model and the video generation model for improved synthesis quality and human preference alignment. For the reward model, our framework ensures reliable and robust reward signals via CoT-guided reasoning, voting-based self-consistency, and preference certainty estimation. Given this, we optimize video foundation models with guidance of signals from reward model's feedback, thus improving the synthesis quality in subject consistency, motion smoothness and aesthetic quality, etc. The reward model and video generation model complement each other and are progressively improved in the multi-...