[2604.02292] Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference
About this article
Abstract page for arXiv paper 2604.02292: Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference
Computer Science > Machine Learning arXiv:2604.02292 (cs) [Submitted on 2 Apr 2026] Title:Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference Authors:Dimitrios Danopoulos, Enrico Lupi, Michael Kagan, Maurizio Pierini View a PDF of the paper titled Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference, by Dimitrios Danopoulos and 3 other authors View PDF HTML (experimental) Abstract:Softmax can become a computational bottleneck in the Transformer model's Multi-Head Attention (MHA) block, particularly in small models under low-precision inference, where exponentiation and normalization incur significant overhead. As such, we suggest using Head-Calibrated Clipped-Linear Softmax (HCCS), a bounded, monotone surrogate to the exponential softmax function, which uses a clipped linear mapping of the max centered attention logits. This approximation produces a stable probability distribution, maintains the ordering of the original logits and has non-negative values. HCCS differs from previous softmax surrogates as it includes a set of lightweight calibration parameters that are optimized offline based on a representative dataset and calibrated for each individual attention head to preserve the statistical properties of the individual heads. We describe a hardware-motivated implementation of HCCS for high-throughput scenarios targeting the AMD Versal AI Engines. The current reference implementations from AMD for this platfor...