[2604.08565] Dynamic sparsity in tree-structured feed-forward layers

[2604.08565] Dynamic sparsity in tree-structured feed-forward layers at scale

arXiv - AI April 13, 2026 3 min read

About this article

Abstract page for arXiv paper 2604.08565: Dynamic sparsity in tree-structured feed-forward layers at scale

Computer Science > Computation and Language arXiv:2604.08565 (cs) [Submitted on 18 Mar 2026] Title:Dynamic sparsity in tree-structured feed-forward layers at scale Authors:Reza Sedghi, Robin Schiewer, Anand Subramoney, David Kappel View a PDF of the paper titled Dynamic sparsity in tree-structured feed-forward layers at scale, by Reza Sedghi and 3 other authors View PDF HTML (experimental) Abstract:At typical context lengths, the feed-forward MLP block accounts for a large share of a transformer's compute budget, motivating sparse alternatives to dense MLP blocks. We study sparse, tree-structured feed-forward layers as drop-in replacements for MLP blocks in deep transformer architectures, enabling conditional computation via hard hierarchical routing without a separate router network. We demonstrate for the first time that this form of tree-structured conditional sparsity can be applied for autoregressive language modeling and downstream question answering, including zero- and few-shot settings, and its scalability beyond 1B parameters. Despite activating fewer than 5% of the feed-forward block's units per token, our models match dense baselines under controlled training and fine-tuning protocols. We further analyze training dynamics and identify an emergent auto-pruning effect: the interaction of hard routing with asymmetric nonlinearities progressively deactivates unused paths, yielding partial conversion of dynamic routing into static structural sparsity. We show that s...

Originally published on April 13, 2026. Curated by AI News.

Machine Learning

How much can a video generated by the same diffusion model differ across GPU architectures if the initial noise latent is fixed? [D]

Hi! I am trying to sanity-check an assumption for diffusion video generation reproducibility. Suppose I run the same video diffusion mode...

Reddit - Machine Learning · 1 min · 26 minutes ago

Llms

I am not an "anti" like this guy, but still an interesting video of person interacting with chat 4o

(Posting Here because removed by Chatgpt Complaints moderators because the model here is 4o, and refuse to believe there were any safety ...

Reddit - Artificial Intelligence · 1 min · about 4 hours ago

Llms

Unsolved AI Mystery Is Solved Along With Lessons Learned On Why ChatGPT Became Oddly Obsessed With Gremlins And Goblins

This article discusses the resolution of an AI mystery regarding ChatGPT's unusual focus on gremlins and goblins, along with insights gai...

AI Tools & Products · 1 min · about 5 hours ago

Llms

[2602.06869] Uncovering Cross-Objective Interference in Multi-Objective Alignment

Abstract page for arXiv paper 2602.06869: Uncovering Cross-Objective Interference in Multi-Objective Alignment

arXiv - Machine Learning · 3 min · about 5 hours ago

[2604.08565] Dynamic sparsity in tree-structured feed-forward layers at scale

About this article

Related Articles

How much can a video generated by the same diffusion model differ across GPU architectures if the initial noise latent is fixed? [D]

I am not an "anti" like this guy, but still an interesting video of person interacting with chat 4o

Unsolved AI Mystery Is Solved Along With Lessons Learned On Why ChatGPT Became Oddly Obsessed With Gremlins And Goblins

[2602.06869] Uncovering Cross-Objective Interference in Multi-Objective Alignment

No comments

Stay updated with AI News