[2604.00235] MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation
About this article
Abstract page for arXiv paper 2604.00235: MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation
Computer Science > Machine Learning arXiv:2604.00235 (cs) [Submitted on 31 Mar 2026] Title:MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation Authors:Jinghan Yao, Sam Adé Jacobs, Walid Krichene, Masahiro Tanaka, Dhabaleswar K Panda View a PDF of the paper titled MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation, by Jinghan Yao and 4 other authors View PDF HTML (experimental) Abstract:Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression, which lowers fidelity, or selection/eviction, which restricts what remains accessible, and both can degrade delayed recall and long-form generation. We introduce MAC-Attention, a fidelity- and access-preserving alternative that accelerates decoding by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that performs pre-RoPE L2 matching over a short local window; an amend stage rectifies the reused attention by recomputing a small band near the match boundary; and a complete stage fuses the rectified results with fresh attention computed on the KV tail through a numerically stable merge. On a match hit, the compute and bandwidth complexity is constant regardless of context length. The method is model-agnostic and composes with IO-aware kernels, paged-KV managers, and MQA/GQA. Across LongBench v2 (120K), RULER (120K), and LongGenBe...