[2604.04440] Training Transformers in Cosine Coefficient Space
About this article
Abstract page for arXiv paper 2604.04440: Training Transformers in Cosine Coefficient Space
Computer Science > Performance arXiv:2604.04440 (cs) [Submitted on 6 Apr 2026] Title:Training Transformers in Cosine Coefficient Space Authors:Mohamed Amine Bergach View a PDF of the paper titled Training Transformers in Cosine Coefficient Space, by Mohamed Amine Bergach View PDF HTML (experimental) Abstract:We parameterize the weight matrices of a transformer in the two-dimensional discrete cosine transform (DCT) domain, retaining only the lowest-frequency coefficients. At each forward pass the full weight matrix is reconstructed via the inverse DCT; gradients propagate through the reconstruction to update the spectral coefficients directly. On character-level language modeling (Shakespeare, 1M characters), a 4-layer transformer trained from scratch in this representation matches the perplexity of the standard parameterization (6.1 vs.\ 6.1) while storing 52\% of the parameters. At 4$\times$ compression (29\% of parameters), the model reaches perplexity 6.9 -- outperforming a low-rank baseline (perplexity 8.8 at 21\% of parameters) at a comparable reduction. The method requires no architectural changes, no pre-trained checkpoint, and no auxiliary loss. It reduces to replacing each \texttt{this http URL} with a drop-in spectral layer that stores $K$ DCT coefficients instead of $n \times m$ weights. Subjects: Performance (cs.PF); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.04440 [cs.PF] (or arXiv:2604.04440v1 [cs.PF] for this version) https://doi.org/10.48550/ar...