[2605.05222] Adaptive Computation Depth via Learned Token Routing in

[2605.05222] Adaptive Computation Depth via Learned Token Routing in Transformers

arXiv - AI May 08, 2026 3 min read

About this article

Abstract page for arXiv paper 2605.05222: Adaptive Computation Depth via Learned Token Routing in Transformers

Computer Science > Machine Learning arXiv:2605.05222 (cs) [Submitted on 18 Apr 2026] Title:Adaptive Computation Depth via Learned Token Routing in Transformers Authors:Ahmed Abdelmuniem Abdalla Mohammed View a PDF of the paper titled Adaptive Computation Depth via Learned Token Routing in Transformers, by Ahmed Abdelmuniem Abdalla Mohammed View PDF HTML (experimental) Abstract:Standard transformer architectures apply the same number of layers to every token regardless of contextual difficulty. We present Token-Selective Attention (TSA), a learned per-token gate on residual updates between consecutive transformer blocks. Each gate is a lightweight two-layer multi-layer perceptron (MLP) that produces a continuous halting probability, making the mechanism end-to-end differentiable with 1.7% parameter overhead and no changes to the base architecture. Notably, TSA learns difficulty-proportional routing without any explicit depth pressure: even at $\lambda=0$ (no depth regularisation), the task-loss gradient alone drives the router to skip 20% of token-layer operations. On character-level language modeling, TSA saved 14-23% of token-layer operations (TLOps) across Tiny-Shakespeare and enwik8 at <0.5% quality loss. At matched efficiency, TSA achieved 0.7% lower validation loss than early exit, and the learned routing transfers directly to inference-time sparse execution for real wall-clock speedup. Comments: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite...

Originally published on May 08, 2026. Curated by AI News.

Machine Learning

What to expect from AlphaZero's value predictions [D]

An AlphaZero agent has learnt to predict the value of a game state by training on data generated by self-play by the model and a series o...

Reddit - Machine Learning · 1 min · about 2 hours ago

Machine Learning

Open Source Projects related to CNNs to Contribute To? [D]

Around a decade a go I was tinkering a lot with CNNs for real time event detection. I enjoyed that a lot and always wanted to get back in...

Reddit - Machine Learning · 1 min · about 2 hours ago

Machine Learning

I Work in Hollywood. Everyone Who Used to Make TV Is Now Secretly Training AI | WIRED

For screenwriters like me—and job seekers all over—AI gig work is the new waiting tables. In eight months, I’ve done 20 of these soul-cru...

Wired - AI · 27 min · about 4 hours ago

Machine Learning

Are Enterprises Using AI in the Wrong Places?

Most enterprise AI discussions still revolve around one question: But I’m starting to think that may be the wrong question entirely. The ...

Reddit - Artificial Intelligence · 1 min · about 4 hours ago

[2605.05222] Adaptive Computation Depth via Learned Token Routing in Transformers

About this article

Related Articles

What to expect from AlphaZero's value predictions [D]

Open Source Projects related to CNNs to Contribute To? [D]

I Work in Hollywood. Everyone Who Used to Make TV Is Now Secretly Training AI | WIRED

Are Enterprises Using AI in the Wrong Places?

No comments

Stay updated with AI News