[2604.08565] Dynamic sparsity in tree-structured feed-forward layers at scale
About this article
Abstract page for arXiv paper 2604.08565: Dynamic sparsity in tree-structured feed-forward layers at scale
Computer Science > Computation and Language arXiv:2604.08565 (cs) [Submitted on 18 Mar 2026] Title:Dynamic sparsity in tree-structured feed-forward layers at scale Authors:Reza Sedghi, Robin Schiewer, Anand Subramoney, David Kappel View a PDF of the paper titled Dynamic sparsity in tree-structured feed-forward layers at scale, by Reza Sedghi and 3 other authors View PDF HTML (experimental) Abstract:At typical context lengths, the feed-forward MLP block accounts for a large share of a transformer's compute budget, motivating sparse alternatives to dense MLP blocks. We study sparse, tree-structured feed-forward layers as drop-in replacements for MLP blocks in deep transformer architectures, enabling conditional computation via hard hierarchical routing without a separate router network. We demonstrate for the first time that this form of tree-structured conditional sparsity can be applied for autoregressive language modeling and downstream question answering, including zero- and few-shot settings, and its scalability beyond 1B parameters. Despite activating fewer than 5% of the feed-forward block's units per token, our models match dense baselines under controlled training and fine-tuning protocols. We further analyze training dynamics and identify an emergent auto-pruning effect: the interaction of hard routing with asymmetric nonlinearities progressively deactivates unused paths, yielding partial conversion of dynamic routing into static structural sparsity. We show that s...