[2602.12601] HyperMLP: An Integrated Perspective for Sequence Modeling
Summary
The paper presents HyperMLP, a novel approach to sequence modeling that reinterprets autoregressive attention as a dynamic two-layer MLP, enhancing performance over traditional softmax-attention methods.
Why It Matters
This research offers a fresh perspective on sequence modeling, potentially improving efficiency and effectiveness in machine learning applications. By proposing HyperMLP and HyperGLU, the authors challenge existing paradigms and provide empirical evidence for their approach, which could influence future developments in AI and NLP.
Key Takeaways
- HyperMLP redefines autoregressive attention as a dynamic MLP.
- The proposed models outperform traditional softmax-attention baselines.
- Theoretical characterizations provide insights into model expressivity.
- Dynamic mixing in feature and sequence space enhances performance.
- The approach aligns temporal mixing with autoregressive semantics.
Computer Science > Machine Learning arXiv:2602.12601 (cs) [Submitted on 13 Feb 2026] Title:HyperMLP: An Integrated Perspective for Sequence Modeling Authors:Jiecheng Lu, Shihao Yang View a PDF of the paper titled HyperMLP: An Integrated Perspective for Sequence Modeling, by Jiecheng Lu and Shihao Yang View PDF HTML (experimental) Abstract:Self-attention is often viewed as probabilistic query-key lookup, motivating designs that preserve normalized attention scores and fixed positional semantics. We advocate a simpler and more unified perspective: an autoregressive attention head can be viewed as a dynamic two-layer MLP whose weights are instantiated from the context history. From this view, attention scores form an ever-growing hidden representation, and standard MLP activations such as ReLU or GLU naturally implement input-conditioned selection over a context-dependent memory pool rather than a probability distribution. Based on this formulation, we introduce HyperMLP and HyperGLU, which learn dynamic mixing in both feature space and sequence space, using a reverse-offset (lag) layout to align temporal mixing with autoregressive semantics. We provide theoretical characterizations of the expressivity and implications of this structure, and empirically show that HyperMLP/HyperGLU consistently outperform strong softmax-attention baselines under matched parameter budgets. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Mac...