[2507.21372] Load Balancing for AI Training Workloads

[2507.21372] Load Balancing for AI Training Workloads

arXiv - Machine Learning 4 min read Article

Summary

This paper evaluates various load-balancing designs for AI training workloads, revealing that packet spraying outperforms traditional methods and introducing a new switch-based implementation called Ofan.

Why It Matters

As AI training demands increase, efficient load balancing becomes crucial for optimizing performance. This research provides insights into the effectiveness of different load-balancing strategies, guiding future implementations and improvements in AI infrastructure.

Key Takeaways

  • Packet spraying is superior to traditional load-balancing methods.
  • Host-based load balancing excels in failure scenarios due to better path condition visibility.
  • No current approach achieves optimal O(1) queue scaling at maximum utilization.
  • The destination-based rotation (DR) discipline can reach optimal performance.
  • The new switch-based implementation, Ofan, shows significant performance gains.

Computer Science > Networking and Internet Architecture arXiv:2507.21372 (cs) [Submitted on 28 Jul 2025 (v1), last revised 13 Feb 2026 (this version, v2)] Title:Load Balancing for AI Training Workloads Authors:Sarah McClure, Evyatar Cohen, Alex Shpiner, Mark Silberstein, Sylvia Ratnasamy, Scott Shenker, Isaac Keslassy View a PDF of the paper titled Load Balancing for AI Training Workloads, by Sarah McClure and 6 other authors View PDF Abstract:The extreme bandwidth demands of AI training has made load-balancing a critical component in AI fabrics, and a variety of load-balancing designs have emerged in recent work from both industry and research. However, there is currently little consensus on which design approach dominates or the conditions under which an approach dominates. We also lack an understanding of how far these approaches are from optimal. We provide a technical foundation for answering these questions by systematically evaluating leading load-balancing designs, while decoupling them from specific congestion control and loss recovery stacks. We find that load-balancing based on packet spraying dominates traditional approaches that load balance traffic at flow, flowlet, or subflow granularities. When comparing host- vs switch-based approaches to packet spraying, we find that they perform similarly in failure-free scenarios but that a host-based approach dominates under link failure because of its rapid visibility into end-to-end path conditions. We also identify ...

Related Articles

UMKC Announces New Master of Science in Artificial Intelligence
Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min ·
University of Tartu thesis: transfer learning boosts Estonian AI models
Machine Learning

University of Tartu thesis: transfer learning boosts Estonian AI models

AI News - General · 4 min ·
Improving AI models’ ability to explain their predictions
Machine Learning

Improving AI models’ ability to explain their predictions

AI News - General · 9 min ·
Machine Learning

AI model suggests CPAP can massively swing heart risk in sleep apnea

AI News - General · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime