[2602.12962] TriGen: NPU Architecture for End-to-End Acceleration of Large Language Models based on SW-HW Co-Design
Summary
The paper presents TriGen, a novel NPU architecture designed for accelerating large language models (LLMs) through software-hardware co-design, achieving significant performance improvements in resource-constrained environments.
Why It Matters
As large language models become increasingly prevalent, optimizing their performance on resource-limited devices is crucial. TriGen addresses this challenge by enhancing computational efficiency and reducing memory transfer, making it relevant for developers and researchers in AI hardware.
Key Takeaways
- TriGen achieves an average 2.73x speedup in performance for LLMs.
- Utilizes low-precision computation to optimize resource use while maintaining accuracy.
- Eliminates the need for specialized hardware for nonlinear operations, reducing costs.
- Implements scheduling techniques to maximize computational utilization under memory constraints.
- Demonstrates significant reductions in memory transfer, enhancing efficiency.
Computer Science > Hardware Architecture arXiv:2602.12962 (cs) [Submitted on 13 Feb 2026] Title:TriGen: NPU Architecture for End-to-End Acceleration of Large Language Models based on SW-HW Co-Design Authors:Jonghun Lee, Junghoon Lee, Hyeonjin Kim, Seoho Jeon, Jisup Yoon, Hyunbin Park, Meejeong Park, Heonjae Ha View a PDF of the paper titled TriGen: NPU Architecture for End-to-End Acceleration of Large Language Models based on SW-HW Co-Design, by Jonghun Lee and 7 other authors View PDF HTML (experimental) Abstract:Recent studies have extensively explored NPU architectures for accelerating AI inference in on-device environments, which are inherently resource-constrained. Meanwhile, transformer-based large language models (LLMs) have become dominant, with rapidly increasing model sizes but low degree of parameter reuse compared to conventional CNNs, making end-to-end execution on resource-limited devices extremely challenging. To address these challenges, we propose TriGen, a novel NPU architecture tailored for resource-constrained environments through software-hardware co-design. Firstly, TriGen adopts low-precision computation using microscaling (MX) to enable additional optimization opportunities while preserving accuracy, and resolves the issues that arise by employing such precision. Secondly, to jointly optimize both nonlinear and linear operations, TriGen eliminates the need for specialized hardware for essential nonlinear operations by using fast and accurate LUT, th...