[2602.23358] A Dataset is Worth 1 MB
Summary
The paper presents PLADA, a novel method for efficient dataset transmission in machine learning, significantly reducing payload size while maintaining accuracy.
Why It Matters
As machine learning models become increasingly complex, the need for efficient data transmission grows. PLADA addresses the challenge of high communication costs in dataset distribution, making it easier for agents to train task-specific models without the burden of large data transfers. This innovation could enhance model training efficiency and accessibility, especially in resource-constrained environments.
Key Takeaways
- PLADA eliminates the need for pixel transmission by using class labels.
- The method retains high classification accuracy with a payload under 1 MB.
- A pruning mechanism filters reference datasets to improve training efficiency.
- Experiments demonstrate effectiveness across 10 diverse datasets.
- This approach could revolutionize dataset serving in machine learning.
Computer Science > Machine Learning arXiv:2602.23358 (cs) [Submitted on 26 Feb 2026] Title:A Dataset is Worth 1 MB Authors:Elad Kimchi Shoshani, Leeyam Gabay, Yedid Hoshen View a PDF of the paper titled A Dataset is Worth 1 MB, by Elad Kimchi Shoshani and 2 other authors View PDF HTML (experimental) Abstract:A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on diverse hardware and software frameworks, transmitting a pre-trained model is often infeasible; instead, agents require raw data to train their own task-specific models locally. While dataset distillation attempts to compress training signals, current methods struggle to scale to high-resolution data and rarely achieve sufficiently small files. In this paper, we propose Pseudo-Labels as Data (PLADA), a method that completely eliminates pixel transmission. We assume agents are preloaded with a large, generic, unlabeled reference dataset (e.g., ImageNet-1K, ImageNet-21K) and communicate a new task by transmitting only the class labels for specific images. To address the distribution mismatch between the reference and target datasets, we introduce a pruning mechanism that filters the reference dataset to retain only the labels of the most semantically relevant images for the target task. This selection process simultaneously maximizes training efficiency and minimizes transmission payload. Experiments on 10 diverse datas...