[2602.22437] veScale-FSDP: Flexible and High-Performance FSDP at Scale
Summary
The paper introduces veScale-FSDP, a new system for Fully Sharded Data Parallel (FSDP) that enhances flexibility and performance for large-scale model training, achieving significant improvements in throughput and memory usage.
Why It Matters
As AI models grow in complexity, efficient training methods become crucial. veScale-FSDP addresses current limitations in FSDP systems, enabling better performance and scalability, which is vital for researchers and practitioners in machine learning and distributed computing.
Key Takeaways
- veScale-FSDP offers a flexible sharding format that enhances model training efficiency.
- Achieves 5-66% higher throughput compared to existing FSDP systems.
- Reduces memory usage by 16-30%, enabling training on tens of thousands of GPUs.
- Supports block-wise quantization and non-element-wise optimizers for advanced model training.
- Addresses limitations of current FSDP implementations in structure-aware training.
Computer Science > Distributed, Parallel, and Cluster Computing arXiv:2602.22437 (cs) [Submitted on 25 Feb 2026] Title:veScale-FSDP: Flexible and High-Performance FSDP at Scale Authors:Zezhou Wang, Youjie Li, Zhiqi Lin, Jiacheng Yang, Cong Xie, Guanyu Feng, Zheng Zhong, Ziyue Huang, Hongyu Zhu, Zhi Zhang, Yanghua Peng, Xin Liu View a PDF of the paper titled veScale-FSDP: Flexible and High-Performance FSDP at Scale, by Zezhou Wang and 11 other authors View PDF HTML (experimental) Abstract:Fully Sharded Data Parallel (FSDP), also known as ZeRO, is widely used for training large-scale models, featuring its flexibility and minimal intrusion on model code. However, current FSDP systems struggle with structure-aware training methods (e.g., block-wise quantized training) and with non-element-wise optimizers (e.g., Shampoo and Muon) used in cutting-edge models (e.g., Gemini, Kimi K2). FSDP's fixed element- or row-wise sharding formats conflict with the block-structured computations. In addition, today's implementations fall short in communication and memory efficiency, limiting scaling to tens of thousands of GPUs. We introduce veScale-FSDP, a redesigned FSDP system that couples a flexible sharding format, RaggedShard, with a structure-aware planning algorithm to deliver both flexibility and performance at scale. veScale-FSDP natively supports efficient data placement required by FSDP, empowering block-wise quantization and non-element-wise optimizers. As a result, veScale-FSDP ac...