[2508.10836] SoK: Data Minimization in Machine Learning

[2508.10836] SoK: Data Minimization in Machine Learning

arXiv - Machine Learning 4 min read Article

Summary

The paper presents a systematization of knowledge on data minimization in machine learning, addressing its importance in regulatory compliance and offering a framework for practitioners.

Why It Matters

Data minimization is crucial for compliance with regulations like GDPR and CPRA, especially in machine learning, where large datasets are common. This paper provides a structured approach to understanding and applying data minimization principles, helping to bridge gaps in current research and practice.

Key Takeaways

  • Data minimization is essential for regulatory compliance in ML applications.
  • The paper introduces a unified framework for understanding data minimization in ML.
  • It highlights the disconnect between existing ML privacy research and data minimization principles.
  • Practitioners can use the framework to identify relevant techniques and trade-offs.
  • The work aims to clarify terminology and metrics related to data minimization.

Computer Science > Machine Learning arXiv:2508.10836 (cs) [Submitted on 14 Aug 2025 (v1), last revised 18 Feb 2026 (this version, v2)] Title:SoK: Data Minimization in Machine Learning Authors:Robin Staab, Nikola Jovanović, Kimberly Mai, Prakhar Ganesh, Martin Vechev, Ferdinando Fioretto, Matthew Jagielski View a PDF of the paper titled SoK: Data Minimization in Machine Learning, by Robin Staab and 6 other authors View PDF Abstract:Data minimization (DM) describes the principle of collecting only the data strictly necessary for a given task. It is a foundational principle across major data protection regulations like GDPR and CPRA. Violations of this principle have substantial real-world consequences, with regulatory actions resulting in fines reaching hundreds of millions of dollars. Notably, the relevance of data minimization is particularly pronounced in machine learning (ML) applications, which typically rely on large datasets, resulting in an emerging research area known as Data Minimization in Machine Learning (DMML). At the same time, existing work on other ML privacy and security topics often addresses concerns relevant to DMML without explicitly acknowledging the connection. This disconnect leads to confusion among practitioners, complicating their efforts to implement DM principles and interpret the terminology, metrics, and evaluation criteria used across different research communities. To address this gap, we present the first systematization of knowledge (SoK) ...

Related Articles

Machine Learning

[HIRING] Machine Learning Evaluation Specialist | Remote | $50/hr

​ We are onboarding domain experts with strong machine learning knowledge to design advanced evaluation tasks for AI systems. About the R...

Reddit - ML Jobs · 1 min ·
Machine Learning

Japan is adopting robotics and physical AI, with a model where startups innovate and corporations provide scale

Physical AI is emerging as one of the next major industrial battlegrounds, with Japan’s push driven more by necessity than anything else....

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

mining hardware doing AI training - is the output actually useful

there's this network that launched recently routing crypto mining hardware toward AI training workloads. miners seem happy with the econo...

Reddit - Artificial Intelligence · 1 min ·
AI is changing how small online sellers decide what to make | MIT Technology Review
Machine Learning

AI is changing how small online sellers decide what to make | MIT Technology Review

Entrepreneurs based in the US are using tools like Alibaba’s Accio to compress weeks of product research and supplier hunting into a sing...

MIT Technology Review · 8 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime