[2605.05222] Adaptive Computation Depth via Learned Token Routing in Transformers

[2605.05222] Adaptive Computation Depth via Learned Token Routing in Transformers

arXiv - AI 3 min read

About this article

Abstract page for arXiv paper 2605.05222: Adaptive Computation Depth via Learned Token Routing in Transformers

Computer Science > Machine Learning arXiv:2605.05222 (cs) [Submitted on 18 Apr 2026] Title:Adaptive Computation Depth via Learned Token Routing in Transformers Authors:Ahmed Abdelmuniem Abdalla Mohammed View a PDF of the paper titled Adaptive Computation Depth via Learned Token Routing in Transformers, by Ahmed Abdelmuniem Abdalla Mohammed View PDF HTML (experimental) Abstract:Standard transformer architectures apply the same number of layers to every token regardless of contextual difficulty. We present Token-Selective Attention (TSA), a learned per-token gate on residual updates between consecutive transformer blocks. Each gate is a lightweight two-layer multi-layer perceptron (MLP) that produces a continuous halting probability, making the mechanism end-to-end differentiable with 1.7% parameter overhead and no changes to the base architecture. Notably, TSA learns difficulty-proportional routing without any explicit depth pressure: even at $\lambda=0$ (no depth regularisation), the task-loss gradient alone drives the router to skip 20% of token-layer operations. On character-level language modeling, TSA saved 14-23% of token-layer operations (TLOps) across Tiny-Shakespeare and enwik8 at <0.5% quality loss. At matched efficiency, TSA achieved 0.7% lower validation loss than early exit, and the learned routing transfers directly to inference-time sparse execution for real wall-clock speedup. Comments: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite...

Originally published on May 08, 2026. Curated by AI News.

Related Articles

Machine Learning

What to expect from AlphaZero's value predictions [D]

An AlphaZero agent has learnt to predict the value of a game state by training on data generated by self-play by the model and a series o...

Reddit - Machine Learning · 1 min ·
Machine Learning

Open Source Projects related to CNNs to Contribute To? [D]

Around a decade a go I was tinkering a lot with CNNs for real time event detection. I enjoyed that a lot and always wanted to get back in...

Reddit - Machine Learning · 1 min ·
I Work in Hollywood. Everyone Who Used to Make TV Is Now Secretly Training AI | WIRED
Machine Learning

I Work in Hollywood. Everyone Who Used to Make TV Is Now Secretly Training AI | WIRED

For screenwriters like me—and job seekers all over—AI gig work is the new waiting tables. In eight months, I’ve done 20 of these soul-cru...

Wired - AI · 27 min ·
Machine Learning

Are Enterprises Using AI in the Wrong Places?

Most enterprise AI discussions still revolve around one question: But I’m starting to think that may be the wrong question entirely. The ...

Reddit - Artificial Intelligence · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime