[2604.04384] Compressible Softmax-Attended Language under Incompressible Attention
About this article
Abstract page for arXiv paper 2604.04384: Compressible Softmax-Attended Language under Incompressible Attention
Computer Science > Computation and Language arXiv:2604.04384 (cs) [Submitted on 6 Apr 2026] Title:Compressible Softmax-Attended Language under Incompressible Attention Authors:Wonsuk Lee View a PDF of the paper titled Compressible Softmax-Attended Language under Incompressible Attention, by Wonsuk Lee View PDF HTML (experimental) Abstract:Across every attention head in five transformer language models (124M--7B parameters, four architecture families), the logit energy field $\tilde{E}$ reaches 90\% of its variance in 2--11 singular components. The \emph{learned} interaction matrix $W_Q^\mathrm{T} W_K$ needs 38--75 components for the same threshold out of $d_h \in \{64, 128\}$. The spectral gap is $5$--$25\times$ in effective rank. The attention mechanism allocates capacity uniformly across all $d_h$ dimensions, but language concentrates the actual interaction into a few. The compressibility of softmax-attended language is a property of the data, not the frame that analyzes it. Comments: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) MSC classes: 68T01 ACM classes: I.2.0 Cite as: arXiv:2604.04384 [cs.CL] (or arXiv:2604.04384v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.04384 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Wonsuk Lee [view email] [v1] Mon, 6 Apr 2026 03:18:27 UTC (8 KB) Full-text links: Access Paper: View a PDF of the paper titled Compressible Softmax-Attend...