[2603.10742] A Grammar of Machine Learning Workflows
About this article
Abstract page for arXiv paper 2603.10742: A Grammar of Machine Learning Workflows
Computer Science > Machine Learning arXiv:2603.10742 (cs) [Submitted on 11 Mar 2026 (v1), last revised 5 Apr 2026 (this version, v3)] Title:A Grammar of Machine Learning Workflows Authors:Simon Roth View a PDF of the paper titled A Grammar of Machine Learning Workflows, by Simon Roth View PDF HTML (experimental) Abstract:Data leakage has been identified in 648 published machine learning papers across 30 scientific fields. The knowledge to prevent it exists; the tools do not enforce it. This paper presents a grammar - eight typed primitives, a directed acyclic graph, and four hard constraints - that makes the most damaging leakage types structurally unrepresentable. The core mechanism is a terminal assessment gate: the first call-time-enforced evaluate/assess boundary in an ML framework, backed by a specification precise enough for independent reimplementation. A companion landscape study across 2,047 datasets grounds the constraints in measured effect sizes. Two reference implementations (Python, R) are available. Comments: Subjects: Machine Learning (cs.LG) ACM classes: I.2.6; D.2.4 Cite as: arXiv:2603.10742 [cs.LG] (or arXiv:2603.10742v3 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.10742 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.5281/zenodo.19406355 Focus to learn more DOI(s) linking to related resources Submission history From: Simon Roth [view email] [v1] Wed, 11 Mar 2026 13:15:33 UTC (118 KB) [v2] Sat, 14 Ma...