[2603.00039] CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation
About this article
Abstract page for arXiv paper 2603.00039: CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation
Computer Science > Machine Learning arXiv:2603.00039 (cs) [Submitted on 9 Feb 2026] Title:CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation Authors:Jitian Zhao, Changho Shin, Tzu-Heng Huang, Satya Sai Srinath Namburi GNVV, Frederic Sala View a PDF of the paper titled CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation, by Jitian Zhao and 4 other authors View PDF HTML (experimental) Abstract:LLM-as-a-judge ensembles are the standard paradigm for scalable evaluation, but their aggregation mechanisms suffer from a fundamental flaw: they implicitly assume that judges provide independent estimates of true quality. However, in practice, LLM judges exhibit correlated errors caused by shared latent confounders -- such as verbosity, stylistic preferences, or training artifacts -- causing standard aggregation rules like majority vote or averaging to provide little gain or even amplify systematic mistakes. To address this, we introduce CARE, a confounder-aware aggregation framework that explicitly models LLM judge scores as arising from both a latent true-quality signal and shared confounding factors. Rather than heuristically re-weighting judges, CARE separates quality from confounders without access to ground-truth labels. We provide theoretical guarantees for identifiability and finite-sample recovery under shared confounders, and we quantify the systematic bias incurred when aggregation models omit confounding latent factors. Across 12 public benchmarks sp...