[2505.21574] Do We Need All the Synthetic Data? Targeted Image Augmentation via Diffusion Models
About this article
Abstract page for arXiv paper 2505.21574: Do We Need All the Synthetic Data? Targeted Image Augmentation via Diffusion Models
Computer Science > Computer Vision and Pattern Recognition arXiv:2505.21574 (cs) [Submitted on 27 May 2025 (v1), last revised 4 Mar 2026 (this version, v3)] Title:Do We Need All the Synthetic Data? Targeted Image Augmentation via Diffusion Models Authors:Dang Nguyen, Jiping Li, Jinghao Zheng, Baharan Mirzasoleiman View a PDF of the paper titled Do We Need All the Synthetic Data? Targeted Image Augmentation via Diffusion Models, by Dang Nguyen and 3 other authors View PDF HTML (experimental) Abstract:Synthetically augmenting training datasets with diffusion models has become an effective strategy for improving the generalization of image classifiers. However, existing approaches typically increase dataset size by 10-30x and struggle to ensure generation diversity, leading to substantial computational overhead. In this work, we introduce TADA (TArgeted Diffusion Augmentation), a principled framework that selectively augments examples that are not learned early in training using faithful synthetic images that preserve semantic features while varying noise. We show that augmenting only this targeted subset consistently outperforms augmenting the entire dataset. Through theoretical analysis on a two-layer CNN, we prove that TADA improves generalization by promoting homogeneity in feature learning speed without amplifying noise. Extensive experiments demonstrate that by augmenting only 30-40% of the training data, TADA improves generalization by up to 2.8% across diverse archite...