[2604.02324] Grounded Token Initialization for New Vocabulary in LMs

[2604.02324] Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

arXiv - Machine Learning April 03, 2026 4 min read

About this article

Abstract page for arXiv paper 2604.02324: Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

Computer Science > Computation and Language arXiv:2604.02324 (cs) [Submitted on 2 Apr 2026] Title:Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation Authors:Daiwei Chen, Zhoutong Fu, Chengming Jiang, Haichao Zhang, Ran Zhou, Tan Wang, Chunnan Yao, Guoyao Li, Rui Cai, Yihan Cao, Ruijie Jiang, Fedor Borisyuk, Jianqiang Shen, Jingwei Wu, Ramya Korlakai Vinayak View a PDF of the paper titled Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation, by Daiwei Chen and 14 other authors View PDF HTML (experimental) Abstract:Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning b...

Originally published on April 03, 2026. Curated by AI News.

Llms

OpenAI now lets teams make custom bots that can do work on their own | The Verge

OpenAI is bringing “workspace” AI agents to users of its Business, Enterprise, Edu, and Teachers plans that can perform business tasks in...

The Verge - AI · 4 min · about 2 hours ago

Llms

My Unsupervised Compliance Layer Project

A bit of context, my work has been mostly around building agentic pipelines. I really love the craft. My latest side project was a delibe...

Reddit - Artificial Intelligence · 1 min · about 2 hours ago

Llms

I’m 17 and built an AI that flirts, remembers you, watches your shows, and replies to your reels…

V3 is done and it’s getting… weird. This thing now: auto-replies to DMs with tone adjustment reads images, transcribes voice notes, repli...

Reddit - Artificial Intelligence · 1 min · about 2 hours ago

Llms

Claude Mythos AI unauthorised access claim probed by Anthropic

submitted by /u/unserious-dude [link] [comments]

Reddit - Artificial Intelligence · 1 min · about 4 hours ago

[2604.02324] Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

About this article

Related Articles

OpenAI now lets teams make custom bots that can do work on their own | The Verge

My Unsupervised Compliance Layer Project

I’m 17 and built an AI that flirts, remembers you, watches your shows, and replies to your reels…

Claude Mythos AI unauthorised access claim probed by Anthropic

No comments

Stay updated with AI News