[2503.04398] Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling
Summary
This paper introduces Semantic Parallelism, a new paradigm for efficient MoE inference that enhances model-data co-scheduling to minimize communication costs and improve throughput in large language model serving.
Why It Matters
As large language models (LLMs) become increasingly prevalent, optimizing their inference processes is crucial for performance and resource management. This research addresses the inefficiencies in expert parallelism, a common approach in LLM serving, by proposing a method that reduces communication overhead, potentially leading to faster and more efficient model deployments.
Key Takeaways
- Semantic Parallelism minimizes communication costs in MoE inference.
- The Sem-MoE framework enhances expert and token collocation on devices.
- Three key scheduling techniques are introduced to improve inference throughput.
- The proposed method significantly reduces all-to-all communication volume.
- Experiments demonstrate superior performance compared to existing solutions.
Computer Science > Machine Learning arXiv:2503.04398 (cs) [Submitted on 6 Mar 2025 (v1), last revised 24 Feb 2026 (this version, v4)] Title:Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling Authors:Yan Li, Zhenyu Zhang, Zhengang Wang, Pengfei Chen, Pengfei Zheng View a PDF of the paper titled Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling, by Yan Li and 4 other authors View PDF HTML (experimental) Abstract:Prevailing LLM serving engines employ expert parallelism (EP) to implement multi-device inference of massive MoE models. However, the efficiency of expert parallel inference is largely bounded by inter-device communication, as EP embraces expensive all-to-all collectives to route tokens to the remote experts if not collocating on the same GPU/NPU device. Nevertheless, state-of-the-art schemes treat expert device-placement and request (or token) device-scheduling as separate concerns, triggering excessive communication between them and compromising inference efficiency This paper proposes Semantic Parallelism, a novel parallelism paradigm that minimizes the steep communication costs in EP-centric MoE serving via model-data collaborative scheduling. We implement Semantic Parallelism in a framework called Sem-MoE. Sem-MoE maximally collocates experts and their activating tokens onto the same device using proactively modeled activation likelihood between them and introduces three key techniques: (1) O...