[2602.22647] Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators
Summary
The paper presents STATIC, a novel approach for efficient constrained decoding in LLM-based generative retrieval, significantly enhancing performance on hardware accelerators.
Why It Matters
As generative retrieval becomes crucial for recommendation systems, optimizing its efficiency on accelerators like TPUs and GPUs is vital. STATIC addresses latency issues inherent in traditional methods, enabling faster and more effective recommendations, which can greatly impact user experience and business outcomes.
Key Takeaways
- STATIC transforms prefix trees into a compressed sparse row matrix for efficient decoding.
- Achieves up to 948x speedup over traditional CPU implementations.
- Maintains minimal latency overhead, crucial for real-time applications.
- First production-scale deployment of constrained generative retrieval.
- Improves cold-start performance significantly in generative retrieval tasks.
Computer Science > Information Retrieval arXiv:2602.22647 (cs) [Submitted on 26 Feb 2026] Title:Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators Authors:Zhengyang Su, Isay Katsman, Yueqi Wang, Ruining He, Lukasz Heldt, Raghunandan Keshavan, Shao-Chuan Wang, Xinyang Yi, Mingyan Gao, Onkar Dalal, Lichan Hong, Ed Chi, Ningren Han View a PDF of the paper titled Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators, by Zhengyang Su and 12 other authors View PDF HTML (experimental) Abstract:Generative retrieval has emerged as a powerful paradigm for LLM-based recommendation. However, industrial recommender systems often benefit from restricting the output space to a constrained subset of items based on business logic (e.g. enforcing content freshness or product category), which standard autoregressive decoding cannot natively support. Moreover, existing constrained decoding methods that make use of prefix trees (Tries) incur severe latency penalties on hardware accelerators (TPUs/GPUs). In this work, we introduce STATIC (Sparse Transition Matrix-Accelerated Trie Index for Constrained Decoding), an efficient and scalable constrained decoding technique designed specifically for high-throughput LLM-based generative retrieval on TPUs/GPUs. By flattening the prefix tree into a static Compressed Sparse Row (CSR) matrix, we transform irregular tree traversals into fully vectorized s...