[2603.22629] LGSE: Lexically Grounded Subword Embedding Initialization for Low-Resource Language Adaptation
About this article
Abstract page for arXiv paper 2603.22629: LGSE: Lexically Grounded Subword Embedding Initialization for Low-Resource Language Adaptation
Computer Science > Computation and Language arXiv:2603.22629 (cs) [Submitted on 23 Mar 2026] Title:LGSE: Lexically Grounded Subword Embedding Initialization for Low-Resource Language Adaptation Authors:Hailay Teklehaymanot, Dren Fazlija, Wolfgang Nejdl View a PDF of the paper titled LGSE: Lexically Grounded Subword Embedding Initialization for Low-Resource Language Adaptation, by Hailay Teklehaymanot and 2 other authors View PDF Abstract:Adapting pretrained language models to low-resource, morphologically rich languages remains a significant challenge. Existing vocabulary expansion methods typically rely on arbitrarily segmented subword units, resulting in fragmented lexical representations and loss of critical morphological information. To address this limitation, we propose the Lexically Grounded Subword Embedding Initialization (LGSE) framework, which introduces morphologically informed segmentation for initializing embeddings of novel tokens. Instead of using random vectors or arbitrary subwords, LGSE decomposes words into their constituent morphemes and constructs semantically coherent embeddings by averaging pretrained subword or FastText-based morpheme representations. When a token cannot be segmented into meaningful morphemes, its embedding is constructed using character n-gram representations to capture structural information. During Language-Adaptive Pretraining, we apply a regularization term that penalizes large deviations of newly introduced embeddings from th...