[2601.11616] Mixture-of-Experts as Soft Clustering: A Dual Jacobian-PCA Spectral Geometry Perspective
Summary
This paper explores Mixture-of-Experts (MoE) architectures through a geometric lens, analyzing their impact on function representation and local sensitivity using a Dual Jacobian-PCA approach.
Why It Matters
Understanding the geometric implications of MoE architectures is crucial for improving their efficiency and performance in machine learning. This study provides insights that could lead to better model designs and applications, particularly in natural language processing and other complex tasks.
Key Takeaways
- MoE routing reduces local sensitivity, indicated by smaller leading singular values.
- Weighted PCA analysis shows expert-local representations have higher effective rank.
- Top-k routing leads to lower-rank structures, while fully soft routing results in broader representations.
- The findings suggest MoEs can be interpreted as soft partitions of function space.
- The study provides testable predictions for expert scaling and ensemble diversity.
Computer Science > Machine Learning arXiv:2601.11616 (cs) [Submitted on 9 Jan 2026 (v1), last revised 18 Feb 2026 (this version, v2)] Title:Mixture-of-Experts as Soft Clustering: A Dual Jacobian-PCA Spectral Geometry Perspective Authors:Feilong Liu View a PDF of the paper titled Mixture-of-Experts as Soft Clustering: A Dual Jacobian-PCA Spectral Geometry Perspective, by Feilong Liu View PDF Abstract:Mixture-of-Experts (MoE) architectures are widely used for efficiency and conditional computation, but their effect on the geometry of learned functions and representations remains poorly understood. We study MoEs through a geometric lens, interpreting routing as soft partitioning into overlapping expert-local charts. We introduce a Dual Jacobian-PCA spectral probe that analyzes local function geometry via Jacobian singular value spectra and representation geometry via weighted PCA of routed hidden states. Using a controlled MLP-MoE setting with exact Jacobian computation, we compare dense, Top-k, and fully soft routing under matched capacity. Across random seeds, MoE routing consistently reduces local sensitivity: expert-local Jacobians show smaller leading singular values and faster spectral decay than dense baselines. Weighted PCA reveals that expert-local representations distribute variance across more principal directions, indicating higher effective rank. We further observe low alignment among expert Jacobians, suggesting decomposition into low-overlap expert-specific tra...