[2603.13606] NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL
About this article
Abstract page for arXiv paper 2603.13606: NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL
Computer Science > Distributed, Parallel, and Cluster Computing arXiv:2603.13606 (cs) [Submitted on 13 Mar 2026 (v1), last revised 24 Mar 2026 (this version, v2)] Title:NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL Authors:Amos Goldman, Nimrod Boker, Maayan Sheraizin, Nimrod Admoni, Artem Polyakov, Subhadeep Bhattacharya, Fan Yu, Kai Sun, Georgios Theodorakis, Hsin-Chun Yin, Peter-Jan Gootzen, Aamir Shafi, Assaf Ravid, Salvatore Di Girolamo, Manjunath Gorentla Venkata, Gil Bloch (NVIDIA Corporation) View a PDF of the paper titled NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL, by Amos Goldman and 15 other authors View PDF HTML (experimental) Abstract:Mixture-of-Experts (MoE) architectures have become essential for scaling large language models, driving the development of specialized device-initiated communication libraries such as DeepEP, Hybrid-EP, and others. These libraries demonstrate the performance benefits of GPU-initiated RDMA for MoE dispatch and combine operations. This paper presents NCCL EP (Expert Parallelism), a ground-up MoE communication library built entirely on NCCL's Device API. NCCL EP provides unified ncclEpDispatch and ncclEpCombine primitives with both C and Python interfaces, supporting Low-Latency (LL) mode for inference decoding and High-Throughput (HT) mode for training and inference prefill. LL targets small batch sizes (1-128 tokens) using direct all-to-all RDMA+NVLink mesh connectivity with double...