[2510.13860] ShishuLM : Achieving Optimal and Efficient Parameterization with Low Attention Transformer Models
About this article
Abstract page for arXiv paper 2510.13860: ShishuLM : Achieving Optimal and Efficient Parameterization with Low Attention Transformer Models
Computer Science > Computation and Language arXiv:2510.13860 (cs) [Submitted on 13 Oct 2025 (v1), last revised 31 Mar 2026 (this version, v2)] Title:ShishuLM : Achieving Optimal and Efficient Parameterization with Low Attention Transformer Models Authors:Shivanshu Kumar, Gopalakrishnan Srinivasan View a PDF of the paper titled ShishuLM : Achieving Optimal and Efficient Parameterization with Low Attention Transformer Models, by Shivanshu Kumar and 1 other authors View PDF HTML (experimental) Abstract:While the transformer architecture has achieved state-of-the-art performance on natural language processing tasks, these models impose substantial memory and computational overhead. Recent research has identified significant architectural redundancies within these models, particularly in the attention sub-layers in the top layers, presenting opportunities for optimization without compromising performance. Taking insights from research on inference-time layer pruning and depth-dependent computation in language models, we introduce an efficient language model architecture referred to as ShishuLM. By replacing full decoder layers at the top of the model with MLP-only blocks, we achieve up to 10-60% improvement in generation latency and 1.3 -5 $\times$ gain in throughput. Upon further sharing parameters across adjacent MLP-only layers of ShishuLM, we obtain up to 20% savings in memory with minimal degradation in performance. Our findings provide insights towards building more effic...