[2512.03903] BERnaT: Basque Encoders for Representing Natural Textual Diversity
About this article
Abstract page for arXiv paper 2512.03903: BERnaT: Basque Encoders for Representing Natural Textual Diversity
Computer Science > Computation and Language arXiv:2512.03903 (cs) [Submitted on 3 Dec 2025 (v1), last revised 23 Mar 2026 (this version, v2)] Title:BERnaT: Basque Encoders for Representing Natural Textual Diversity Authors:Ekhi Azurmendi, Joseba Fernandez de Landa, Jaione Bengoetxea, Maite Heredia, Julen Etxaniz, Mikel Zubillaga, Ander Soraluze, Aitor Soroa View a PDF of the paper titled BERnaT: Basque Encoders for Representing Natural Textual Diversity, by Ekhi Azurmendi and 7 other authors View PDF HTML (experimental) Abstract:Language models depend on massive text corpora that are often filtered for quality, a process that can unintentionally exclude non-standard linguistic varieties, reduce model robustness and reinforce representational biases. In this paper, we argue that language models should aim to capture the full spectrum of language variation (dialectal, historical, informal, etc.) rather than relying solely on standardized text. Focusing on the Basque language, we construct new corpora combining standard, social media, and historical sources, and pre-train the BERnaT family of encoder-only models in three configurations: standard, diverse, and combined. We further propose an evaluation framework that separates Natural Language Understanding (NLU) tasks into standard and diverse subsets to assess linguistic generalization. Results show that models trained on both standard and diverse data consistently outperform those trained on standard corpora, improving perf...