[2602.21788] DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism
Summary
The paper presents Dynamic Hybrid Parallelism (DHP), a new strategy for efficiently scaling the training of Multimodal Large Language Models (MLLMs) by adapting communication groups and parallelism degrees to improve hardware utilization.
Why It Matters
As MLLMs become increasingly important in AI applications, optimizing their training efficiency is crucial. DHP addresses common issues in existing frameworks, such as load imbalance and communication overhead, making it relevant for researchers and practitioners in machine learning and AI infrastructure.
Key Takeaways
- DHP adapts communication and parallelism dynamically to enhance training efficiency.
- The method outperforms existing frameworks like Megatron-LM and DeepSpeed.
- Achieves up to 1.36x speedup in training throughput with near-linear scaling.
- Utilizes a polynomial-time algorithm for near-optimal parallelism strategies.
- Maintains high hardware efficiency despite extreme data variability.
Computer Science > Distributed, Parallel, and Cluster Computing arXiv:2602.21788 (cs) [Submitted on 25 Feb 2026] Title:DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism Authors:Yifan Niu, Han Xiao, Dongyi Liu, Wei Zhou, Jia Li View a PDF of the paper titled DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism, by Yifan Niu and 4 other authors View PDF HTML (experimental) Abstract:Scaling long-context capabilities is crucial for Multimodal Large Language Models (MLLMs). However, real-world multimodal datasets are extremely heterogeneous. Existing training frameworks predominantly rely on static parallelism strategies, which suffer from severe load imbalance, redundant communication, and suboptimal hardware utilization under data heterogeneity. In this work, we propose Dynamic Hybrid Parallelism (DHP), an efficient parallelism strategy that adaptively reconfigures communication groups and parallelism degrees during MLLM training. We generalize the non-power-of-two parallelism degrees and develop a polynomial-time algorithm to generate near-optimal parallelism strategies with only millisecond-level overhead per training batch. DHP is able to maintain high hardware efficiency even under extreme data variability. Experimental results demonstrate that DHP significantly outperforms Megatron-LM and DeepSpeed, achieving up to 1.36 $\times$ speedup in training throughput while maintaining near-linear scaling efficiency across large-scale NPU ...