[2605.05222] Adaptive Computation Depth via Learned Token Routing in Transformers
About this article
Abstract page for arXiv paper 2605.05222: Adaptive Computation Depth via Learned Token Routing in Transformers
Computer Science > Machine Learning arXiv:2605.05222 (cs) [Submitted on 18 Apr 2026] Title:Adaptive Computation Depth via Learned Token Routing in Transformers Authors:Ahmed Abdelmuniem Abdalla Mohammed View a PDF of the paper titled Adaptive Computation Depth via Learned Token Routing in Transformers, by Ahmed Abdelmuniem Abdalla Mohammed View PDF HTML (experimental) Abstract:Standard transformer architectures apply the same number of layers to every token regardless of contextual difficulty. We present Token-Selective Attention (TSA), a learned per-token gate on residual updates between consecutive transformer blocks. Each gate is a lightweight two-layer multi-layer perceptron (MLP) that produces a continuous halting probability, making the mechanism end-to-end differentiable with 1.7% parameter overhead and no changes to the base architecture. Notably, TSA learns difficulty-proportional routing without any explicit depth pressure: even at $\lambda=0$ (no depth regularisation), the task-loss gradient alone drives the router to skip 20% of token-layer operations. On character-level language modeling, TSA saved 14-23% of token-layer operations (TLOps) across Tiny-Shakespeare and enwik8 at <0.5% quality loss. At matched efficiency, TSA achieved 0.7% lower validation loss than early exit, and the learned routing transfers directly to inference-time sparse execution for real wall-clock speedup. Comments: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite...