[2602.22437] veScale-FSDP: Flexible and High-Performance FSDP at Scale

[2602.22437] veScale-FSDP: Flexible and High-Performance FSDP at Scale

arXiv - Machine Learning 3 min read Article

Summary

The paper introduces veScale-FSDP, a new system for Fully Sharded Data Parallel (FSDP) that enhances flexibility and performance for large-scale model training, achieving significant improvements in throughput and memory usage.

Why It Matters

As AI models grow in complexity, efficient training methods become crucial. veScale-FSDP addresses current limitations in FSDP systems, enabling better performance and scalability, which is vital for researchers and practitioners in machine learning and distributed computing.

Key Takeaways

  • veScale-FSDP offers a flexible sharding format that enhances model training efficiency.
  • Achieves 5-66% higher throughput compared to existing FSDP systems.
  • Reduces memory usage by 16-30%, enabling training on tens of thousands of GPUs.
  • Supports block-wise quantization and non-element-wise optimizers for advanced model training.
  • Addresses limitations of current FSDP implementations in structure-aware training.

Computer Science > Distributed, Parallel, and Cluster Computing arXiv:2602.22437 (cs) [Submitted on 25 Feb 2026] Title:veScale-FSDP: Flexible and High-Performance FSDP at Scale Authors:Zezhou Wang, Youjie Li, Zhiqi Lin, Jiacheng Yang, Cong Xie, Guanyu Feng, Zheng Zhong, Ziyue Huang, Hongyu Zhu, Zhi Zhang, Yanghua Peng, Xin Liu View a PDF of the paper titled veScale-FSDP: Flexible and High-Performance FSDP at Scale, by Zezhou Wang and 11 other authors View PDF HTML (experimental) Abstract:Fully Sharded Data Parallel (FSDP), also known as ZeRO, is widely used for training large-scale models, featuring its flexibility and minimal intrusion on model code. However, current FSDP systems struggle with structure-aware training methods (e.g., block-wise quantized training) and with non-element-wise optimizers (e.g., Shampoo and Muon) used in cutting-edge models (e.g., Gemini, Kimi K2). FSDP's fixed element- or row-wise sharding formats conflict with the block-structured computations. In addition, today's implementations fall short in communication and memory efficiency, limiting scaling to tens of thousands of GPUs. We introduce veScale-FSDP, a redesigned FSDP system that couples a flexible sharding format, RaggedShard, with a structure-aware planning algorithm to deliver both flexibility and performance at scale. veScale-FSDP natively supports efficient data placement required by FSDP, empowering block-wise quantization and non-element-wise optimizers. As a result, veScale-FSDP ac...

Related Articles

Llms

This Is Not Hacking. This Is Structured Intelligence.

Watch me demonstrate everything I've been talking about—live, in real time. The Setup: Maestro University AI enrollment system Standard c...

Reddit - Artificial Intelligence · 1 min ·
Llms

[D] Howcome Muon is only being used for Transformers?

Muon has quickly been adopted in LLM training, yet we don't see it being talked about in other contexts. Searches for Muon on ConvNets tu...

Reddit - Machine Learning · 1 min ·
Llms

[P] I trained a language model from scratch for a low resource language and got it running fully on-device on Android (no GPU, demo)

Hi Everybody! I just wanted to share an update on a project I’ve been working on called BULaMU, a family of language models trained (20M,...

Reddit - Machine Learning · 1 min ·
Popular AI gateway startup LiteLLM ditches controversial startup Delve | TechCrunch
Llms

Popular AI gateway startup LiteLLM ditches controversial startup Delve | TechCrunch

LiteLLM had obtained two security compliance certifications via Delve and fell victim to some horrific credential-stealing malware last w...

TechCrunch - AI · 3 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime