[2509.25678] Massively Multimodal Foundation Models: A Framework for Capturing Interactions with Specialized Mixture-of-Experts
About this article
Abstract page for arXiv paper 2509.25678: Massively Multimodal Foundation Models: A Framework for Capturing Interactions with Specialized Mixture-of-Experts
Computer Science > Machine Learning arXiv:2509.25678 (cs) [Submitted on 30 Sep 2025 (v1), last revised 28 Feb 2026 (this version, v4)] Title:Massively Multimodal Foundation Models: A Framework for Capturing Interactions with Specialized Mixture-of-Experts Authors:Xing Han, Hsing-Huan Chung, Joydeep Ghosh, Paul Pu Liang, Suchi Saria View a PDF of the paper titled Massively Multimodal Foundation Models: A Framework for Capturing Interactions with Specialized Mixture-of-Experts, by Xing Han and 4 other authors View PDF HTML (experimental) Abstract:Modern applications increasingly involve many heterogeneous input streams, such as clinical sensors, wearable device data, imaging, and text, each with distinct measurement models, sampling rates, and noise characteristics. We define this as massively multimodal setting, where each sensor constitutes a separate modality. As modality counts grow, capturing their complex, time-varying interactions such as delayed physiological cascades between sensors, has becomes essential yet challenging. Mixture-of-Experts (MoE) architectures are naturally suited for this setting since their sparse routing mechanism enables efficient scaling across many modalities. However, existing MoE architectures route tokens based on similarity alone, overlooking the rich temporal dependencies across modalities: this prevents the model from capturing delayed cross-modal effects, leading to suboptimal expert specialization and reduced accuracy. We propose a fra...