[2604.02324] Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation
About this article
Abstract page for arXiv paper 2604.02324: Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation
Computer Science > Computation and Language arXiv:2604.02324 (cs) [Submitted on 2 Apr 2026] Title:Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation Authors:Daiwei Chen, Zhoutong Fu, Chengming Jiang, Haichao Zhang, Ran Zhou, Tan Wang, Chunnan Yao, Guoyao Li, Rui Cai, Yihan Cao, Ruijie Jiang, Fedor Borisyuk, Jianqiang Shen, Jingwei Wu, Ramya Korlakai Vinayak View a PDF of the paper titled Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation, by Daiwei Chen and 14 other authors View PDF HTML (experimental) Abstract:Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning b...