[2509.26601] MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages
About this article
Abstract page for arXiv paper 2509.26601: MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages
Computer Science > Computation and Language arXiv:2509.26601 (cs) [Submitted on 30 Sep 2025 (v1), last revised 28 Feb 2026 (this version, v3)] Title:MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages Authors:Chenxi Whitehouse, Sebastian Ruder, Tony Lin, Oksana Kurylo, Haruka Takagi, Janice Lam, Nicolò Busetto, Denise Diaz, Francisco Guzmán View a PDF of the paper titled MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages, by Chenxi Whitehouse and 8 other authors View PDF HTML (experimental) Abstract:Ensuring native-like quality of large language model (LLM) responses across many languages is challenging. To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms. Using MENLO, we create a dataset of 6,423 human-annotated prompt-response preference pairs covering four quality dimensions with high inter-annotator agreement in 47 language varieties. Our evaluation reveals that zero-shot LLM judges benefit significantly from pairwise evaluation and our structured annotation rubrics, yet they still underperform human annotators on our dataset. We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches. Additionally, we show that RL-trained judges can serve as generative reward models to enha...