[2604.04599] LP-GEMM: Integrating Layout Propagation into GEMM Operations
About this article
Abstract page for arXiv paper 2604.04599: LP-GEMM: Integrating Layout Propagation into GEMM Operations
Computer Science > Distributed, Parallel, and Cluster Computing arXiv:2604.04599 (cs) [Submitted on 6 Apr 2026] Title:LP-GEMM: Integrating Layout Propagation into GEMM Operations Authors:César Guedes Carneiro, Lucas Alvarenga, Guido Araujo, Sandro Rigo View a PDF of the paper titled LP-GEMM: Integrating Layout Propagation into GEMM Operations, by C\'esar Guedes Carneiro and 3 other authors View PDF HTML (experimental) Abstract:In Scientific Computing and modern Machine Learning (ML) workloads, sequences of dependent General Matrix Multiplications (GEMMs) often dominate execution time. While state-of-the-art BLAS libraries aggressively optimize individual GEMM calls, they remain constrained by the BLAS API, which requires each call to independently pack input matrices and restore outputs to a canonical memory layout. In sequential GEMMs, these constraints cause redundant packing and unpacking, wasting valuable computational resources. This paper introduces LP-GEMM, a decomposition of the GEMM kernel that enables packing-layout propagation across sequential GEMM operations. This approach eliminates unnecessary data repacking while preserving full BLAS semantic correctness at the boundaries. We evaluate LP-GEMM on x86 (AVX-512) and RISC-V (RVV 1.0) architectures across MLP-like and Attention-like workloads. Our results show average speedups of 2.25x over OpenBLAS on Intel x86 for sequential GEMMs and competitive gains relative to vendor-optimized libraries such as Intel MKL. ...