[2403.16125] Arena: Efficiently Training Large Models via Dynamic Scheduling and Adaptive Parallelism Co-Design
About this article
Abstract page for arXiv paper 2403.16125: Arena: Efficiently Training Large Models via Dynamic Scheduling and Adaptive Parallelism Co-Design
Computer Science > Distributed, Parallel, and Cluster Computing arXiv:2403.16125 (cs) [Submitted on 24 Mar 2024 (v1), last revised 24 Mar 2026 (this version, v2)] Title:Arena: Efficiently Training Large Models via Dynamic Scheduling and Adaptive Parallelism Co-Design Authors:Chunyu Xue, Weihao Cui, Quan Chen, Chen Chen, Han Zhao, Shulai Zhang, Linmei Wang, Yan Li, Limin Xiao, Weifeng Zhang, Jing Yang, Bingsheng He, Minyi Guo View a PDF of the paper titled Arena: Efficiently Training Large Models via Dynamic Scheduling and Adaptive Parallelism Co-Design, by Chunyu Xue and 12 other authors View PDF HTML (experimental) Abstract:Efficiently training large-scale models (LMs) in GPU clusters involves two separate avenues: inter-job dynamic scheduling and intra-job adaptive parallelism (AP). However, existing dynamic schedulers struggle with large-model scheduling due to the mismatch between static parallelism (SP)-aware scheduling and AP-based execution, leading to cluster inefficiencies such as degraded throughput and prolonged job queuing. This paper presents Arena, a large-model training system that co-designs dynamic scheduling and adaptive parallelism to achieve high cluster efficiency. To reduce scheduling costs while improving decision quality, Arena designs low-cost, disaggregated profiling and AP-tailored, load-aware performance estimation, while unifying them by sharding the joint scheduling-parallelism optimization space via a grid abstraction. Building on this, Arena...