[2602.23358] A Dataset is Worth 1 MB

[2602.23358] A Dataset is Worth 1 MB

arXiv - Machine Learning 4 min read Article

Summary

The paper presents PLADA, a novel method for efficient dataset transmission in machine learning, significantly reducing payload size while maintaining accuracy.

Why It Matters

As machine learning models become increasingly complex, the need for efficient data transmission grows. PLADA addresses the challenge of high communication costs in dataset distribution, making it easier for agents to train task-specific models without the burden of large data transfers. This innovation could enhance model training efficiency and accessibility, especially in resource-constrained environments.

Key Takeaways

  • PLADA eliminates the need for pixel transmission by using class labels.
  • The method retains high classification accuracy with a payload under 1 MB.
  • A pruning mechanism filters reference datasets to improve training efficiency.
  • Experiments demonstrate effectiveness across 10 diverse datasets.
  • This approach could revolutionize dataset serving in machine learning.

Computer Science > Machine Learning arXiv:2602.23358 (cs) [Submitted on 26 Feb 2026] Title:A Dataset is Worth 1 MB Authors:Elad Kimchi Shoshani, Leeyam Gabay, Yedid Hoshen View a PDF of the paper titled A Dataset is Worth 1 MB, by Elad Kimchi Shoshani and 2 other authors View PDF HTML (experimental) Abstract:A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on diverse hardware and software frameworks, transmitting a pre-trained model is often infeasible; instead, agents require raw data to train their own task-specific models locally. While dataset distillation attempts to compress training signals, current methods struggle to scale to high-resolution data and rarely achieve sufficiently small files. In this paper, we propose Pseudo-Labels as Data (PLADA), a method that completely eliminates pixel transmission. We assume agents are preloaded with a large, generic, unlabeled reference dataset (e.g., ImageNet-1K, ImageNet-21K) and communicate a new task by transmitting only the class labels for specific images. To address the distribution mismatch between the reference and target datasets, we introduce a pruning mechanism that filters the reference dataset to retain only the labels of the most semantically relevant images for the target task. This selection process simultaneously maximizes training efficiency and minimizes transmission payload. Experiments on 10 diverse datas...

Related Articles

Llms

[R] Depth-first pruning transfers: GPT-2 → TinyLlama with stable gains and minimal loss

TL;DR: Removing the right layers (instead of shrinking all layers) makes transformer models ~8–12% smaller with only ~6–8% quality loss, ...

Reddit - Machine Learning · 1 min ·
Llms

Built a training stability monitor that detects instability before your loss curve shows anything — open sourced the core today

Been working on a weight divergence trajectory curvature approach to detecting neural network training instability. Treats weight updates...

Reddit - Artificial Intelligence · 1 min ·
UMKC Announces New Master of Science in Artificial Intelligence
Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min ·
Improving AI models’ ability to explain their predictions
Machine Learning

Improving AI models’ ability to explain their predictions

AI News - General · 9 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime