Llms Machine Learning Ai Infrastructure Data Science

[2602.22437] veScale-FSDP: Flexible and High-Performance FSDP at Scale

arXiv - Machine Learning February 27, 2026 3 min read Article

Summary

The paper introduces veScale-FSDP, a new system for Fully Sharded Data Parallel (FSDP) that enhances flexibility and performance for large-scale model training, achieving significant improvements in throughput and memory usage.

Why It Matters

As AI models grow in complexity, efficient training methods become crucial. veScale-FSDP addresses current limitations in FSDP systems, enabling better performance and scalability, which is vital for researchers and practitioners in machine learning and distributed computing.

Key Takeaways

veScale-FSDP offers a flexible sharding format that enhances model training efficiency.
Achieves 5-66% higher throughput compared to existing FSDP systems.
Reduces memory usage by 16-30%, enabling training on tens of thousands of GPUs.
Supports block-wise quantization and non-element-wise optimizers for advanced model training.
Addresses limitations of current FSDP implementations in structure-aware training.

Computer Science > Distributed, Parallel, and Cluster Computing arXiv:2602.22437 (cs) [Submitted on 25 Feb 2026] Title:veScale-FSDP: Flexible and High-Performance FSDP at Scale Authors:Zezhou Wang, Youjie Li, Zhiqi Lin, Jiacheng Yang, Cong Xie, Guanyu Feng, Zheng Zhong, Ziyue Huang, Hongyu Zhu, Zhi Zhang, Yanghua Peng, Xin Liu View a PDF of the paper titled veScale-FSDP: Flexible and High-Performance FSDP at Scale, by Zezhou Wang and 11 other authors View PDF HTML (experimental) Abstract:Fully Sharded Data Parallel (FSDP), also known as ZeRO, is widely used for training large-scale models, featuring its flexibility and minimal intrusion on model code. However, current FSDP systems struggle with structure-aware training methods (e.g., block-wise quantized training) and with non-element-wise optimizers (e.g., Shampoo and Muon) used in cutting-edge models (e.g., Gemini, Kimi K2). FSDP's fixed element- or row-wise sharding formats conflict with the block-structured computations. In addition, today's implementations fall short in communication and memory efficiency, limiting scaling to tens of thousands of GPUs. We introduce veScale-FSDP, a redesigned FSDP system that couples a flexible sharding format, RaggedShard, with a structure-aware planning algorithm to deliver both flexibility and performance at scale. veScale-FSDP natively supports efficient data placement required by FSDP, empowering block-wise quantization and non-element-wise optimizers. As a result, veScale-FSDP ac...

Read Original Article

[2602.22437] veScale-FSDP: Flexible and High-Performance FSDP at Scale

Summary

Why It Matters

Key Takeaways

Related Articles

This Is Not Hacking. This Is Structured Intelligence.

[D] Howcome Muon is only being used for Transformers?

[P] I trained a language model from scratch for a low resource language and got it running fully on-device on Android (no GPU, demo)

Popular AI gateway startup LiteLLM ditches controversial startup Delve | TechCrunch

No comments

Stay updated with AI News