[2603.19427] Vocabulary shapes cross-lingual variation of word-order learnability in language models
About this article
Abstract page for arXiv paper 2603.19427: Vocabulary shapes cross-lingual variation of word-order learnability in language models
Computer Science > Computation and Language arXiv:2603.19427 (cs) [Submitted on 19 Mar 2026] Title:Vocabulary shapes cross-lingual variation of word-order learnability in language models Authors:Jonas Mayer Martins, Jaap Jumelet, Viola Priesemann, Lisa Beinborn View a PDF of the paper titled Vocabulary shapes cross-lingual variation of word-order learnability in language models, by Jonas Mayer Martins and Jaap Jumelet and Viola Priesemann and Lisa Beinborn View PDF HTML (experimental) Abstract:Why do some languages like Czech permit free word order, while others like English do not? We address this question by pretraining transformer language models on a spectrum of synthetic word-order variants of natural languages. We observe that greater word-order irregularity consistently raises model surprisal, indicating reduced learnability. Sentence reversal, however, affects learnability only weakly. A coarse distinction of free- (e.g., Czech and Finnish) and fixed-word-order languages (e.g., English and French) does not explain cross-lingual variation. Instead, the structure of the word and subword vocabulary strongly predicts the model surprisal. Overall, vocabulary structure emerges as a key driver of computational word-order learnability across languages. Comments: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.19427 [cs.CL] (or arXiv:2603.19427v1 [cs.CL] for this version) https://doi.org/10.48550/...