[2511.17844] Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation
Summary
This article presents a novel data-efficient approach for fine-tuning text-to-video generation models, demonstrating that low-quality synthetic data can outperform high-fidelity datasets in generating controllable video outputs.
Why It Matters
The findings challenge traditional assumptions about data quality in machine learning, particularly in generative models. By showing that less can be more, this research could lead to more accessible and efficient training methods, reducing the need for extensive datasets and enabling broader applications in AI-driven video generation.
Key Takeaways
- Data-efficient fine-tuning can enhance text-to-video generation.
- Low-quality synthetic data may yield better results than high-fidelity datasets.
- The study provides a framework for understanding the effectiveness of sparse data.
- This approach can reduce the barriers to entry for developing advanced AI models.
- Implications for future research in generative AI and machine learning practices.
Computer Science > Computer Vision and Pattern Recognition arXiv:2511.17844 (cs) [Submitted on 21 Nov 2025 (v1), last revised 23 Feb 2026 (this version, v3)] Title:Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation Authors:Shihan Cheng, Nilesh Kulkarni, David Hyde, Dmitriy Smirnov View a PDF of the paper titled Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation, by Shihan Cheng and 3 other authors View PDF HTML (experimental) Abstract:Fine-tuning large-scale text-to-video diffusion models to add new generative controls, such as those over physical camera parameters (e.g., shutter speed or aperture), typically requires vast, high-fidelity datasets that are difficult to acquire. In this work, we propose a data-efficient fine-tuning strategy that learns these controls from sparse, low-quality synthetic data. We show that not only does fine-tuning on such simple data enable the desired controls, it actually yields superior results to models fine-tuned on photorealistic "real" data. Beyond demonstrating these results, we provide a framework that justifies this phenomenon both intuitively and quantitatively. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) MSC classes: 68U05 ACM classes: I.3.3; I.5.4 Cite as: arXiv:2511.17844 [cs.CV] (or arXiv:2511.17844v3 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.17844 Focus to learn more arXiv-issued DOI via DataCite ...