[2602.19762] Hexagon-MLIR: An AI Compilation Stack For Qualcomm's Neural Processing Units (NPUs)
Summary
Hexagon-MLIR presents an open-source compilation stack designed for Qualcomm's NPUs, enhancing AI workload performance by optimizing Triton kernels and PyTorch models.
Why It Matters
This development is significant as it provides a flexible, open-source solution for AI compilation, addressing the need for efficient deployment of AI models on specialized hardware. By leveraging the MLIR framework, it aims to reduce bandwidth bottlenecks and improve performance for developers working with Qualcomm's NPUs.
Key Takeaways
- Hexagon-MLIR optimizes AI workloads for Qualcomm's NPUs.
- It supports automated compilation from Triton kernels to binary.
- The framework enhances data locality in memory, reducing bandwidth issues.
- Developers gain a flexible tool for advancing AI compilation capabilities.
- The project is ongoing, with plans for further optimizations.
Computer Science > Programming Languages arXiv:2602.19762 (cs) [Submitted on 23 Feb 2026] Title:Hexagon-MLIR: An AI Compilation Stack For Qualcomm's Neural Processing Units (NPUs) Authors:Mohammed Javed Absar, Muthu Baskaran, Abhikrant Sharma, Abhilash Bhandari, Ankit Aggarwal, Arun Rangasamy, Dibyendu Das, Fateme Hosseini, Franck Slama, Iulian Brumar, Jyotsna Verma, Krishnaprasad Bindumadhavan, Mitesh Kothari, Mohit Gupta, Ravishankar Kolachana, Richard Lethin, Samarth Narang, Sanjay Motilal Ladwa, Shalini Jain, Snigdha Suresh Dalvi, Tasmia Rahman, Venkat Rasagna Reddy Komatireddy, Vivek Vasudevbhai Pandya, Xiyue Shi, Zachary Zipper View a PDF of the paper titled Hexagon-MLIR: An AI Compilation Stack For Qualcomm's Neural Processing Units (NPUs), by Mohammed Javed Absar and 24 other authors View PDF HTML (experimental) Abstract:In this paper, we present Hexagon-MLIR,an open-source compilation stack that targets Qualcomm Hexagon Neural Processing Unit (NPU) and provides unified support for lowering Triton kernels and PyTorch models . Built using the MLIR framework, our compiler applies a structured sequence of passes to exploit NPU architectural features to accelerate AI workloads. It enables faster deployment of new Triton kernels (hand-written or subgraphs from PyTorch 2.0), for our target by providing automated compilation from kernel to binary. By ingesting Triton kernels, we generate mega-kernels that maximize data locality in the NPU's Tightly Coupled Memory (TCM), r...