[2510.04694] Multilingual Routing in Mixture-of-Experts
Summary
This paper explores multilingual routing in Mixture-of-Experts (MoE) architectures, revealing how these models handle multilingual data and improve performance through targeted interventions.
Why It Matters
Understanding how MoE models process multilingual data is crucial for enhancing their performance in diverse linguistic contexts. This research provides insights into routing dynamics that can lead to better multilingual AI applications, making it relevant for developers and researchers in AI and NLP.
Key Takeaways
- MoE models exhibit language-specific routing patterns in early and late layers, with cross-lingual alignment in middle layers.
- Performance in a given language correlates with routing similarity to English, indicating a need for language-universal experts.
- Targeted interventions in middle layers can enhance multilingual performance by promoting task experts activated in English.
- Simple routing interventions yield consistent performance gains across multiple languages and models.
- Generalization in MoEs is constrained by their ability to utilize language-universal experts effectively.
Computer Science > Computation and Language arXiv:2510.04694 (cs) [Submitted on 6 Oct 2025 (v1), last revised 17 Feb 2026 (this version, v2)] Title:Multilingual Routing in Mixture-of-Experts Authors:Lucas Bandarkar, Chenyuan Yang, Mohsen Fayyaz, Junlin Hu, Nanyun Peng View a PDF of the paper titled Multilingual Routing in Mixture-of-Experts, by Lucas Bandarkar and 4 other authors View PDF HTML (experimental) Abstract:Mixture-of-Experts (MoE) architectures have become the key to scaling modern LLMs, yet little is understood about how their sparse routing dynamics respond to multilingual data. In this work, we analyze expert routing patterns using parallel multilingual datasets and present highly interpretable layer-wise phenomena. We find that MoE models route tokens in language-specific ways in the early and late decoder layers but exhibit significant cross-lingual routing alignment in middle layers, mirroring parameter-sharing trends observed in dense LLMs. In particular, we reveal a clear, strong correlation between a model's performance in a given language and how similarly its tokens are routed to English in these layers. Extending beyond correlation, we explore inference-time interventions that induce higher cross-lingual routing alignment. We introduce a method that steers the router by promoting middle-layer task experts frequently activated in English, and it successfully increases multilingual performance. These 1-2% gains are remarkably consistent across two eval...