[2602.13697] No Need to Train Your RDB Foundation Model
Summary
The paper presents a novel approach to utilizing relational databases (RDBs) for predictive modeling without the need for retraining models, leveraging in-context learning (ICL) and scalable SQL primitives.
Why It Matters
This research addresses a significant challenge in machine learning where retraining models for new predictions can be resource-intensive. By proposing a method that allows for the use of existing foundation models with RDBs without retraining, it enhances efficiency and accessibility for data-driven applications.
Key Takeaways
- Introduces a method to use RDBs for predictive modeling without retraining.
- Emphasizes the importance of in-context learning (ICL) for handling multiple interrelated tables.
- Demonstrates that encoder expressiveness is maintained without trainable parameters.
- Provides scalable SQL primitives for practical implementation.
- Offers an open-source RDB foundation model capable of robust performance on unseen datasets.
Computer Science > Artificial Intelligence arXiv:2602.13697 (cs) [Submitted on 14 Feb 2026] Title:No Need to Train Your RDB Foundation Model Authors:Linjie Xu, Yanlin Zhang, Quan Gan, Minjie Wang, David Wipf View a PDF of the paper titled No Need to Train Your RDB Foundation Model, by Linjie Xu and 4 other authors View PDF Abstract:Relational databases (RDBs) contain vast amounts of heterogeneous tabular information that can be exploited for predictive modeling purposes. But since the space of potential targets is vast across enterprise settings, how can we \textit{avoid retraining} a new model each time we wish to predict a new quantity of interest? Foundation models based on in-context learning (ICL) offer a convenient option, but so far are largely restricted to single-table operability. In generalizing to multiple interrelated tables, it is essential to compress variably-sized RDB neighborhoods into fixed-length ICL samples for consumption by the decoder. However, the details here are critical: unlike existing supervised learning RDB pipelines, we provide theoretical and empirical evidence that ICL-specific compression should be constrained \emph{within} high-dimensional RDB columns where all entities share units and roles, not \textit{across} columns where the relevance of heterogeneous data types cannot possibly be determined without label information. Conditioned on this restriction, we then demonstrate that encoder expressiveness is actually not compromised by excl...