[2604.04199] Which Leakage Types Matter?
About this article
Abstract page for arXiv paper 2604.04199: Which Leakage Types Matter?
Computer Science > Machine Learning arXiv:2604.04199 (cs) [Submitted on 5 Apr 2026] Title:Which Leakage Types Matter? Authors:Simon Roth View a PDF of the paper titled Which Leakage Types Matter?, by Simon Roth View PDF HTML (experimental) Abstract:Twenty-eight within-subject counterfactual experiments across 2,047 tabular datasets, plus a boundary experiment on 129 temporal datasets, measuring the severity of four data leakage classes in machine learning. Class I (estimation - fitting scalers on full data) is negligible: all nine conditions produce $|\Delta\text{AUC}| \leq 0.005$. Class II (selection - peeking, seed cherry-picking) is substantial: ~90% of the measured effect is noise exploitation that inflates reported scores. Class III (memorization) scales with model capacity: d_z = 0.37 (Naive Bayes) to 1.11 (Decision Tree). Class IV (boundary) is invisible under random CV. The textbook emphasis is inverted: normalization leakage matters least; selection leakage at practical dataset sizes matters most. Comments: Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.04199 [cs.LG] (or arXiv:2604.04199v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.04199 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Simon Roth [view email] [v1] Sun, 5 Apr 2026 17:47:46 UTC (235 KB) Full-text links: Access Paper: View a PDF of the paper titled Which Leakage Types Matter?, by Simon RothView PDFHTML (experimental)TeX S...