[2603.26786] A Step Toward Federated Pretraining of Multimodal Large Language Models
About this article
Abstract page for arXiv paper 2603.26786: A Step Toward Federated Pretraining of Multimodal Large Language Models
Computer Science > Machine Learning arXiv:2603.26786 (cs) [Submitted on 25 Mar 2026] Title:A Step Toward Federated Pretraining of Multimodal Large Language Models Authors:Baochen Xiong, Yifan Xu, Xiaoshan Yang, Yaguang Song, Yaowei Wang, Changsheng Xu View a PDF of the paper titled A Step Toward Federated Pretraining of Multimodal Large Language Models, by Baochen Xiong and 5 other authors View PDF HTML (experimental) Abstract:The rapid evolution of Multimodal Large Language Models (MLLMs) is bottlenecked by the saturation of high-quality public data, while vast amounts of diverse multimodal data remain inaccessible in privacy-sensitive silos. Federated Learning (FL) offers a promising solution to unlock these distributed resources, but existing research focuses predominantly on fine-tuning, leaving the foundational pre-training phase largely unexplored. In this paper, we formally introduce the Federated MLLM Alignment (Fed-MA) task, a lightweight pre-training paradigm that freezes the vision encoder and LLM while collaboratively training the cross-modal projector. We identify two critical challenges in this setting: (i) parameter interference in aggregating local projectors; and (ii) gradient oscillations in one-pass collaborative SGD. To address these challenges, we propose Fed-CMP, a pioneering framework for federated MLLM pre-training. Fed-CMP employs Canonical Reliability-Aware Aggregation, which constructs a canonical space to decompose client projectors into a share...