[2509.26601] MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages

[2509.26601] MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages

arXiv - Machine Learning 4 min read

About this article

Abstract page for arXiv paper 2509.26601: MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages

Computer Science > Computation and Language arXiv:2509.26601 (cs) [Submitted on 30 Sep 2025 (v1), last revised 28 Feb 2026 (this version, v3)] Title:MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages Authors:Chenxi Whitehouse, Sebastian Ruder, Tony Lin, Oksana Kurylo, Haruka Takagi, Janice Lam, Nicolò Busetto, Denise Diaz, Francisco Guzmán View a PDF of the paper titled MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages, by Chenxi Whitehouse and 8 other authors View PDF HTML (experimental) Abstract:Ensuring native-like quality of large language model (LLM) responses across many languages is challenging. To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms. Using MENLO, we create a dataset of 6,423 human-annotated prompt-response preference pairs covering four quality dimensions with high inter-annotator agreement in 47 language varieties. Our evaluation reveals that zero-shot LLM judges benefit significantly from pairwise evaluation and our structured annotation rubrics, yet they still underperform human annotators on our dataset. We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches. Additionally, we show that RL-trained judges can serve as generative reward models to enha...

Originally published on March 03, 2026. Curated by AI News.

Related Articles

Llms

CLI for Google AI Search (gai.google) — run AI-powered code/tech searches headlessly from your terminal

Google AI (gai.google) gives Gemini-powered answers for technical queries — think AI-enhanced search with code understanding. I built a C...

Reddit - Artificial Intelligence · 1 min ·
Llms

Why are we blindly trusting AI companies with our data?

Lately I’ve been seeing a story floating around that really made me pause. Apparently, there were claims that the US government asked Ant...

Reddit - Artificial Intelligence · 1 min ·
De-aged casts, ChatGPT-generated programs: How AI is changing Korean TV
Llms

De-aged casts, ChatGPT-generated programs: How AI is changing Korean TV

Artificial intelligence is transforming every corner of industry, and television is no exception. Major networks in Korea have recently a...

AI Tools & Products · 4 min ·
[2603.16629] MLLM-based Textual Explanations for Face Comparison
Llms

[2603.16629] MLLM-based Textual Explanations for Face Comparison

Abstract page for arXiv paper 2603.16629: MLLM-based Textual Explanations for Face Comparison

arXiv - AI · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime