[2602.20223] MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning
Summary
The paper introduces MultiModalPFN, an extension of TabPFN designed for multimodal tabular learning, effectively integrating diverse data types like text and images.
Why It Matters
This research addresses the limitations of existing models in handling heterogeneous data, which is crucial for applications in fields such as healthcare and marketing. By improving the integration of various data modalities, it enhances the potential for more accurate and comprehensive data analysis.
Key Takeaways
- MultiModalPFN extends TabPFN to unify tabular and non-tabular data.
- The model includes innovative components like multi-head gated MLP and cross-attention pooler.
- Extensive experiments show MMPFN outperforms existing state-of-the-art methods.
- The framework is scalable and effective for heterogeneous data learning.
- Source code is available for further exploration and application.
Computer Science > Machine Learning arXiv:2602.20223 (cs) [Submitted on 23 Feb 2026] Title:MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning Authors:Wall Kim, Chaeyoung Song, Hanul Kim View a PDF of the paper titled MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning, by Wall Kim and 2 other authors View PDF HTML (experimental) Abstract:Recently, TabPFN has gained attention as a foundation model for tabular data. However, it struggles to integrate heterogeneous modalities such as images and text, which are common in domains like healthcare and marketing, thereby limiting its applicability. To address this, we present the Multi-Modal Prior-data Fitted Network (MMPFN), which extends TabPFN to handle tabular and non-tabular modalities in a unified manner. MMPFN comprises per-modality encoders, modality projectors, and pre-trained foundation models. The modality projectors serve as the critical bridge, transforming non-tabular embeddings into tabular-compatible tokens for unified processing. To this end, we introduce a multi-head gated MLP and a cross-attention pooler that extract richer context from non-tabular inputs while mitigates attention imbalance issue in multimodal learning. Extensive experiments on medical and general-purpose multimodal datasets demonstrate that MMPFN consistently outperforms competitive state-of-the-art methods and effectively exploits non-tabular modalities alongside tabular features...