[2501.07575] Dataset Distillation via Committee Voting
Summary
The paper presents a novel method for dataset distillation called Committee Voting for Dataset Distillation (CV-DD), which enhances data quality by leveraging multiple models' insights to create a compact dataset for efficient model training.
Why It Matters
As machine learning models grow increasingly complex, the need for efficient training datasets becomes critical. This research introduces a method that not only improves dataset quality but also addresses issues of bias and overfitting, making it highly relevant for practitioners and researchers in AI and machine learning.
Key Takeaways
- CV-DD utilizes committee voting to synthesize high-quality datasets.
- The method reduces model-specific biases and enhances generalization.
- Extensive experiments show CV-DD outperforms existing distillation techniques.
- The approach improves robustness and alleviates overfitting.
- Code for the method is publicly available for further exploration.
Computer Science > Computer Vision and Pattern Recognition arXiv:2501.07575 (cs) [Submitted on 13 Jan 2025 (v1), last revised 14 Feb 2026 (this version, v2)] Title:Dataset Distillation via Committee Voting Authors:Jiacheng Cui, Zhaoyi Li, Xiaochen Ma, Xinyue Bi, Yaxin Luo, Zhiqiang Shen View a PDF of the paper titled Dataset Distillation via Committee Voting, by Jiacheng Cui and Zhaoyi Li and Xiaochen Ma and Xinyue Bi and Yaxin Luo and Zhiqiang Shen View PDF HTML (experimental) Abstract:Dataset distillation aims to synthesize a compact yet representative dataset that preserves the essential characteristics of the original data for efficient model training. Existing methods mainly focus on improving data-synthetic alignment or scaling distillation to large datasets. In this work, we propose $\textbf{C}$ommittee $\textbf{V}$oting for $\textbf{D}$ataset $\textbf{D}$istillation ($\textbf{CV-DD}$), an orthogonal approach that leverages the collective knowledge of multiple models to produce higher-quality distilled data. We first establish a strong baseline that achieves state-of-the-art performance through modern architectural and optimization choices. By integrating distributions and predictions from multiple models and generating high-quality soft labels, our method captures a broader range of data characteristics, reduces model-specific bias and the impact of distribution shifts, and significantly improves generalization. This voting-based strategy enhances diversity and rob...