[2512.08777] Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages
About this article
Abstract page for arXiv paper 2512.08777: Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages
Computer Science > Computation and Language arXiv:2512.08777 (cs) [Submitted on 9 Dec 2025 (v1), last revised 27 Mar 2026 (this version, v2)] Title:Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages Authors:David Samuel, Lilja Øvrelid, Erik Velldal, Andrey Kutuzov View a PDF of the paper titled Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages, by David Samuel and 2 other authors View PDF HTML (experimental) Abstract:We propose a post-training method for lower-resource languages that preserves the fluency of language models even when aligned by disfluent reward models. Preference optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese. Lower-resource languages lack both datasets written by native speakers and instruction-tuned language models capable of generating fluent synthetic data. To address this, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language. Our approach uses an on-policy training method, which we compare with two common alternatives: supervised finetuning on machine-translated data and multilingual finetuning. We conduct a case study on Norwegian Bokmål and evaluate fluency through native-speaker assessments. The results show that the on-policy aspect is crucial and outperforms the alternatives without relying on any hard-to-obtain data. Subjects: Computation and L...