[2604.04204] Which English Do LLMs Prefer? Triangulating Structural Bias Towards American English in Foundation Models
About this article
Abstract page for arXiv paper 2604.04204: Which English Do LLMs Prefer? Triangulating Structural Bias Towards American English in Foundation Models
Computer Science > Computation and Language arXiv:2604.04204 (cs) [Submitted on 5 Apr 2026] Title:Which English Do LLMs Prefer? Triangulating Structural Bias Towards American English in Foundation Models Authors:Mir Tafseer Nayeem, Davood Rafiei View a PDF of the paper titled Which English Do LLMs Prefer? Triangulating Structural Bias Towards American English in Foundation Models, by Mir Tafseer Nayeem and 1 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) are increasingly deployed in high-stakes domains, yet they expose only limited language settings, most notably "English (US)," despite the global diversity and colonial history of English. Through a postcolonial framing to explain the broader significance, we investigate how geopolitical histories of data curation, digital dominance, and linguistic standardization shape the LLM development pipeline. Focusing on two dominant standard varieties, American English (AmE) and British English (BrE), we construct a curated corpus of 1,813 AmE--BrE variants and introduce DiAlign, a dynamic, training-free method for estimating dialectal alignment using distributional evidence. We operationalize structural bias by triangulating evidence across three stages: (i) audits of six major pretraining corpora reveal systematic skew toward AmE, (ii) tokenizer analyses show that BrE forms incur higher segmentation costs, and (iii) generative evaluations show a persistent AmE preference in model outputs. To our ...