[2602.22552] Relatron: Automating Relational Machine Learning over Relational Databases
Summary
The paper presents Relatron, a system that automates relational machine learning over relational databases, addressing the challenges of feature engineering and model selection.
Why It Matters
As relational databases are pivotal for various applications, optimizing machine learning processes over them is crucial. Relatron's findings on model selection and architecture performance can significantly impact how practitioners approach relational machine learning, potentially leading to more efficient and effective solutions.
Key Takeaways
- Relatron automates the selection between relational deep learning and deep feature synthesis based on task-specific signals.
- Performance of relational deep learning is highly task-dependent and does not consistently outperform deep feature synthesis.
- No single architecture is superior across all tasks, highlighting the importance of task-aware model selection.
Computer Science > Machine Learning arXiv:2602.22552 (cs) [Submitted on 26 Feb 2026] Title:Relatron: Automating Relational Machine Learning over Relational Databases Authors:Zhikai Chen, Han Xie, Jian Zhang, Jiliang Tang, Xiang Song, Huzefa Rangwala View a PDF of the paper titled Relatron: Automating Relational Machine Learning over Relational Databases, by Zhikai Chen and 5 other authors View PDF HTML (experimental) Abstract:Predictive modeling over relational databases (RDBs) powers applications, yet remains challenging due to capturing both cross-table dependencies and complex feature interactions. Relational Deep Learning (RDL) methods automate feature engineering via message passing, while classical approaches like Deep Feature Synthesis (DFS) rely on predefined non-parametric aggregators. Despite performance gains, the comparative advantages of RDL over DFS and the design principles for selecting effective architectures remain poorly understood. We present a comprehensive study that unifies RDL and DFS in a shared design space and conducts architecture-centric searches across diverse RDB tasks. Our analysis yields three key findings: (1) RDL does not consistently outperform DFS, with performance being highly task-dependent; (2) no single architecture dominates across tasks, underscoring the need for task-aware model selection; and (3) validation accuracy is an unreliable guide for architecture choice. This search yields a model performance bank that links architectur...