[2603.01691] Building a Strong Instruction Language Model for a Less-Resourced Language
About this article
Abstract page for arXiv paper 2603.01691: Building a Strong Instruction Language Model for a Less-Resourced Language
Computer Science > Computation and Language arXiv:2603.01691 (cs) [Submitted on 2 Mar 2026] Title:Building a Strong Instruction Language Model for a Less-Resourced Language Authors:Domen Vreš, Tjaša Arčon, Timotej Petrič, Dario Vajda, Marko Robnik-Šikonja, Iztok Lebar Bajec View a PDF of the paper titled Building a Strong Instruction Language Model for a Less-Resourced Language, by Domen Vre\v{s} and 5 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) have become an essential tool for natural language processing and artificial intelligence in general. Current open-source models are primarily trained on English texts, resulting in poorer performance on less-resourced languages and cultures. We present a set of methodological approaches necessary for the successful adaptation of an LLM to a less-resourced language, and demonstrate them using the Slovene language. We present GaMS3-12B, a generative model for Slovene with 12 billion parameters, and demonstrate that it is the best-performing open-source model for Slovene within its parameter range. We adapted the model to the Slovene language using three-stage continual pre-training of the Gemma 3 model, followed by two-stage supervised fine-tuning (SFT). We trained the model on a combination of 140B Slovene, English, Bosnian, Serbian, and Croatian pretraining tokens, and over 200 thousand English and Slovene SFT examples. We evaluate GaMS3-12B on the Slovenian-LLM-Eval datasets, English-to-Sloven...