[2510.08431] Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency
Summary
This paper presents a novel approach to large-scale diffusion distillation using a score-regularized continuous-time consistency model, addressing challenges in generating high-quality images and videos.
Why It Matters
As machine learning applications increasingly require high-quality image and video generation, this research offers a scalable solution that enhances visual fidelity while maintaining diversity, which is crucial for practical implementations in various AI fields.
Key Takeaways
- Introduces score-regularized continuous-time consistency model (rCM) for improved image and video generation.
- Demonstrates significant enhancements in visual quality and diversity over existing methods.
- Achieves high-fidelity sample generation in fewer steps, accelerating diffusion sampling by up to 50 times.
- Validates effectiveness on large models with over 10 billion parameters.
- Provides a theoretically grounded framework for advancing large-scale diffusion distillation.
Computer Science > Computer Vision and Pattern Recognition arXiv:2510.08431 (cs) [Submitted on 9 Oct 2025 (v1), last revised 15 Feb 2026 (this version, v2)] Title:Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency Authors:Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, Qinsheng Zhang View a PDF of the paper titled Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency, by Kaiwen Zheng and 9 other authors View PDF HTML (experimental) Abstract:Although continuous-time consistency models (e.g., sCM, MeanFlow) are theoretically principled and empirically powerful for fast academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian-vector product (JVP) computation and the limitations of evaluation benchmarks like FID. This work represents the first effort to scale up continuous-time consistency to general application-level image and video diffusion models, and to make JVP-based distillation effective at large scale. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the "mode-covering" nature of its forward-diver...