[2508.10836] SoK: Data Minimization in Machine Learning
Summary
The paper presents a systematization of knowledge on data minimization in machine learning, addressing its importance in regulatory compliance and offering a framework for practitioners.
Why It Matters
Data minimization is crucial for compliance with regulations like GDPR and CPRA, especially in machine learning, where large datasets are common. This paper provides a structured approach to understanding and applying data minimization principles, helping to bridge gaps in current research and practice.
Key Takeaways
- Data minimization is essential for regulatory compliance in ML applications.
- The paper introduces a unified framework for understanding data minimization in ML.
- It highlights the disconnect between existing ML privacy research and data minimization principles.
- Practitioners can use the framework to identify relevant techniques and trade-offs.
- The work aims to clarify terminology and metrics related to data minimization.
Computer Science > Machine Learning arXiv:2508.10836 (cs) [Submitted on 14 Aug 2025 (v1), last revised 18 Feb 2026 (this version, v2)] Title:SoK: Data Minimization in Machine Learning Authors:Robin Staab, Nikola Jovanović, Kimberly Mai, Prakhar Ganesh, Martin Vechev, Ferdinando Fioretto, Matthew Jagielski View a PDF of the paper titled SoK: Data Minimization in Machine Learning, by Robin Staab and 6 other authors View PDF Abstract:Data minimization (DM) describes the principle of collecting only the data strictly necessary for a given task. It is a foundational principle across major data protection regulations like GDPR and CPRA. Violations of this principle have substantial real-world consequences, with regulatory actions resulting in fines reaching hundreds of millions of dollars. Notably, the relevance of data minimization is particularly pronounced in machine learning (ML) applications, which typically rely on large datasets, resulting in an emerging research area known as Data Minimization in Machine Learning (DMML). At the same time, existing work on other ML privacy and security topics often addresses concerns relevant to DMML without explicitly acknowledging the connection. This disconnect leads to confusion among practitioners, complicating their efforts to implement DM principles and interpret the terminology, metrics, and evaluation criteria used across different research communities. To address this gap, we present the first systematization of knowledge (SoK) ...