[2603.04424] When Scaling Fails: Network and Fabric Effects on Distributed GPU Training Performance
About this article
Abstract page for arXiv paper 2603.04424: When Scaling Fails: Network and Fabric Effects on Distributed GPU Training Performance
Computer Science > Networking and Internet Architecture arXiv:2603.04424 (cs) [Submitted on 16 Feb 2026] Title:When Scaling Fails: Network and Fabric Effects on Distributed GPU Training Performance Authors:Dinesh Gopalan, Ratul Ali View a PDF of the paper titled When Scaling Fails: Network and Fabric Effects on Distributed GPU Training Performance, by Dinesh Gopalan and 1 other authors View PDF HTML (experimental) Abstract:Scaling distributed GPU training is commonly assumed to yield predictable performance gains as additional nodes are added. In practice, many large-scale deployments encounter diminishing returns and unstable behavior well before theoretical limits are reached. This paper examines why scaling fails in real systems, with a focus on the role of network and fabric effects that are often overlooked by higher-level training frameworks. We present an empirical study of distributed GPU training performance across multiple production-scale clusters. Our results show that network topology, congestion dynamics, collective synchronization behavior, and GPU locality frequently dominate end-to-end training performance once workloads move beyond a small number of nodes. Identical models and software stacks can exhibit sharply different scaling characteristics depending on fabric design and runtime communication patterns. We identify recurring failure modes that emerge as training transitions from single-node to multi-node execution, including synchronization amplificat...