[2602.14301] DeepFusion: Accelerating MoE Training via Federated Knowledge Distillation from Heterogeneous Edge Devices
Summary
DeepFusion introduces a scalable framework for federated training of Mixture-of-Experts (MoE) models, leveraging knowledge distillation from heterogeneous edge devices to enhance performance while reducing communication costs.
Why It Matters
As large language models (LLMs) become increasingly prevalent, the need for efficient training methods that respect data privacy is critical. DeepFusion addresses the challenges of traditional federated learning by enabling resource-constrained devices to contribute to MoE training, thus broadening access to advanced AI capabilities while maintaining privacy.
Key Takeaways
- DeepFusion enables federated training of MoE models from diverse edge devices.
- The framework reduces communication costs by up to 71% compared to traditional methods.
- It introduces a novel View-Aligned Attention module to align predictions across different architectures.
- Experiments show performance close to centralized training while preserving data privacy.
- DeepFusion is particularly beneficial for resource-constrained devices in real-world applications.
Computer Science > Machine Learning arXiv:2602.14301 (cs) [Submitted on 15 Feb 2026] Title:DeepFusion: Accelerating MoE Training via Federated Knowledge Distillation from Heterogeneous Edge Devices Authors:Songyuan Li, Jia Hu, Ahmed M. Abdelmoniem, Geyong Min, Haojun Huang, Jiwei Huang View a PDF of the paper titled DeepFusion: Accelerating MoE Training via Federated Knowledge Distillation from Heterogeneous Edge Devices, by Songyuan Li and 5 other authors View PDF HTML (experimental) Abstract:Recent Mixture-of-Experts (MoE)-based large language models (LLMs) such as Qwen-MoE and DeepSeek-MoE are transforming generative AI in natural language processing. However, these models require vast and diverse training data. Federated learning (FL) addresses this challenge by leveraging private data from heterogeneous edge devices for privacy-preserving MoE training. Nonetheless, traditional FL approaches require devices to host local MoE models, which is impractical for resource-constrained devices due to large model sizes. To address this, we propose DeepFusion, the first scalable federated MoE training framework that enables the fusion of heterogeneous on-device LLM knowledge via federated knowledge distillation, yielding a knowledge-abundant global MoE model. Specifically, DeepFusion features each device to independently configure and train an on-device LLM tailored to its own needs and hardware limitations. Furthermore, we propose a novel View-Aligned Attention (VAA) module tha...