[2509.22957] Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas
About this article
Abstract page for arXiv paper 2509.22957: Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas
Computer Science > Machine Learning arXiv:2509.22957 (cs) [Submitted on 26 Sep 2025 (v1), last revised 2 Mar 2026 (this version, v2)] Title:Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas Authors:Luke Guerdan, Justin Whitehouse, Kimberly Truong, Kenneth Holstein, Zhiwei Steven Wu View a PDF of the paper titled Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas, by Luke Guerdan and 4 other authors View PDF HTML (experimental) Abstract:As Generative AI (GenAI) systems see growing adoption, a key concern involves the external validity of evaluations, or the extent to which they generalize from lab-based to real-world deployment conditions. Threats to the external validity of GenAI evaluations arise when the source sample of human raters and system outputs used to obtain a system quality estimate differs from the target distribution at deployment time. In this work, we propose a doubly-robust estimation framework designed to address this evaluation sampling bias. Key to our approach is the use of "persona" ratings produced by prompting an LLM evaluator (i.e., an LLM-as-a-judge) to behave as a human rater with specific sociodemographic characteristics. Our doubly-robust framework combines these informative yet imperfect persona ratings with human ratings obtained under evaluation sampling bias to produce statistically valid system quality estimates. In particular, we show that our approach yields valid system qual...