[2603.08343] Rethinking Attention Output Projection: Structured

[2603.08343] Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers

arXiv - Machine Learning March 31, 2026 4 min read

About this article

Abstract page for arXiv paper 2603.08343: Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers

Computer Science > Machine Learning arXiv:2603.08343 (cs) [Submitted on 9 Mar 2026 (v1), last revised 30 Mar 2026 (this version, v2)] Title:Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers Authors:Shubham Aggarwal, Lokendra Kumar View a PDF of the paper titled Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers, by Shubham Aggarwal and 1 other authors View PDF HTML (experimental) Abstract:The dense output projection in multi head attention scales quadratically with model dimension, contributing significantly to parameter count, memory footprint, and inference cost. We propose replacing this projection with a fixed, parameter free Walsh Hadamard Transform (WHT) followed by a diagonal affine transformation. This approach eliminates approximately 25 percent of attention parameters per block while maintaining global cross-head interaction through an orthogonal, norm-preserving transformation. Our results demonstrate that WHT augmented models exhibit a steeper validation loss curve relative to training FLOPs compared to dense baselines, suggesting superior compute utilization during training. Crucially, we show that efficiency gains including reduced memory footprint and increased throughput grow monotonically with model size, batch size, and sequence length. We evaluate performance across both prefill and decoding stages, finding that the structured transform consistently outperforms...

Originally published on March 31, 2026. Curated by AI News.

Machine Learning

Coherence Without Convergence: A New Protocol for Multi-Agent AI

Opening For the past year, most progress in multi-agent AI has followed a familiar pattern: Add more agents. Add more coordination. Watch...

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

Machine Learning

Week 6 AIPass update - answering the top questions from last post (file conflicts, remote models, scale)

Followup to last post with answers to the top questions from the comments. Appreciate everyone who jumped in. The most common one by a mi...

Reddit - Artificial Intelligence · 1 min · about 3 hours ago

Llms

Honest ChatGPT vs Claude comparison after using both daily for a month

got tired of reading comparisons that were obvisously written by people who tested each tool for 20 minutes so i ran both at $20/month fo...

Reddit - Artificial Intelligence · 1 min · about 3 hours ago

Machine Learning

What if attention didn’t need matrix multiplication?

I built a cognitive architecture where all computation reduces to three bit operations: XOR, MAJ, POPCNT. No GEMM. No GPU. No floating-po...

Reddit - Artificial Intelligence · 1 min · about 3 hours ago

[2603.08343] Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers

About this article

Related Articles

Coherence Without Convergence: A New Protocol for Multi-Agent AI

Week 6 AIPass update - answering the top questions from last post (file conflicts, remote models, scale)

Honest ChatGPT vs Claude comparison after using both daily for a month

What if attention didn’t need matrix multiplication?

No comments

Stay updated with AI News