[2602.15091] Mixture-of-Experts under Finite-Rate Gating: Communication--Generalization Trade-offs
Summary
This paper explores the trade-offs in Mixture-of-Experts (MoE) architectures under finite-rate gating, focusing on communication efficiency and generalization capabilities through an information-theoretic lens.
Why It Matters
Understanding the communication-generalization trade-offs in MoE systems is crucial for optimizing machine learning models, especially in scenarios with limited communication bandwidth. This research provides insights that can enhance model performance and efficiency in real-world applications.
Key Takeaways
- MoE architectures utilize specialized expert sub-networks for improved prediction tasks.
- Finite-rate gating introduces a communication-theoretic perspective, impacting model expressivity and generalization.
- The study develops a mutual-information generalization bound to characterize rate-distortion in MoE systems.
- Numerical simulations validate the theoretical findings on gating rate and generalization trade-offs.
- Capacity-aware limits are established for communication-constrained MoE systems.
Statistics > Machine Learning arXiv:2602.15091 (stat) [Submitted on 16 Feb 2026] Title:Mixture-of-Experts under Finite-Rate Gating: Communication--Generalization Trade-offs Authors:Ali Khalesi, Mohammad Reza Deylam Salehi View a PDF of the paper titled Mixture-of-Experts under Finite-Rate Gating: Communication--Generalization Trade-offs, by Ali Khalesi and Mohammad Reza Deylam Salehi View PDF HTML (experimental) Abstract:Mixture-of-Experts (MoE) architectures decompose prediction tasks into specialized expert sub-networks selected by a gating mechanism. This letter adopts a communication-theoretic view of MoE gating, modeling the gate as a stochastic channel operating under a finite information rate. Within an information-theoretic learning framework, we specialize a mutual-information generalization bound and develop a rate-distortion characterization $D(R_g)$ of finite-rate gating, where $R_g:=I(X; T)$, yielding (under a standard empirical rate-distortion optimality condition) $\mathbb{E}[R(W)] \le D(R_g)+\delta_m+\sqrt{(2/m)\, I(S; W)}$. The analysis yields capacity-aware limits for communication-constrained MoE systems, and numerical simulations on synthetic multi-expert models empirically confirm the predicted trade-offs between gating rate, expressivity, and generalization. Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG) Cite as: arXiv:2602.15091 [stat.ML] (or arXiv:2602.15091v1 [stat.ML] for this version) https://doi.or...