[2603.08343] Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers
About this article
Abstract page for arXiv paper 2603.08343: Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers
Computer Science > Machine Learning arXiv:2603.08343 (cs) [Submitted on 9 Mar 2026 (v1), last revised 30 Mar 2026 (this version, v2)] Title:Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers Authors:Shubham Aggarwal, Lokendra Kumar View a PDF of the paper titled Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers, by Shubham Aggarwal and 1 other authors View PDF HTML (experimental) Abstract:The dense output projection in multi head attention scales quadratically with model dimension, contributing significantly to parameter count, memory footprint, and inference cost. We propose replacing this projection with a fixed, parameter free Walsh Hadamard Transform (WHT) followed by a diagonal affine transformation. This approach eliminates approximately 25 percent of attention parameters per block while maintaining global cross-head interaction through an orthogonal, norm-preserving transformation. Our results demonstrate that WHT augmented models exhibit a steeper validation loss curve relative to training FLOPs compared to dense baselines, suggesting superior compute utilization during training. Crucially, we show that efficiency gains including reduced memory footprint and increased throughput grow monotonically with model size, batch size, and sequence length. We evaluate performance across both prefill and decoding stages, finding that the structured transform consistently outperforms...