[2604.27747] Position-Aware Drafting for Inference Acceleration in LLM-Based Generative List-Wise Recommendation
About this article
Abstract page for arXiv paper 2604.27747: Position-Aware Drafting for Inference Acceleration in LLM-Based Generative List-Wise Recommendation
Computer Science > Information Retrieval arXiv:2604.27747 (cs) [Submitted on 30 Apr 2026] Title:Position-Aware Drafting for Inference Acceleration in LLM-Based Generative List-Wise Recommendation Authors:Jiaju Chen, Chongming Gao, Chenxiao Fan, Haoyan Liu, Qingpeng Cai, Peng Jiang, Xiangnan He View a PDF of the paper titled Position-Aware Drafting for Inference Acceleration in LLM-Based Generative List-Wise Recommendation, by Jiaju Chen and 6 other authors View PDF HTML (experimental) Abstract:Large language model (LLM)-based generative list-wise recommendation has advanced rapidly, but decoding remains sequential and thus latency-prone. To accelerate inference without changing the target distribution, speculative decoding (SD) uses a small draft model to propose several next tokens at once and a target LLM to verify and accept the longest prefix, skipping multiple steps per round. In generative recommendation, however, each item is represented by multiple semantic-ID tokens, often with separators, and current drafts typically treat these tokens uniformly. This overlooks two practical facts: (i) a token's semantics depend on its within-item slot, and (ii) uncertainty tends to increase with speculation depth. Without modeling these effects, SD's speedups can be limited. We introduce PAD-Rec, Position-Aware Drafting for generative Recommendation, a lightweight module that augments the draft model with two complementary signals. Item position embeddings explicitly encode the ...