[2604.02178] The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level
About this article
Abstract page for arXiv paper 2604.02178: The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level
Computer Science > Computation and Language arXiv:2604.02178 (cs) [Submitted on 2 Apr 2026] Title:The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level Authors:Jeremy Herbst, Jae Hee Lee, Stefan Wermter View a PDF of the paper titled The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level, by Jeremy Herbst and 2 other authors View PDF HTML (experimental) Abstract:Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using $k$-sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token-level processors. Instead, they function as fine-grained task experts, speci...