[2603.00039] CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation

[2603.00039] CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation

arXiv - Machine Learning 3 min read

About this article

Abstract page for arXiv paper 2603.00039: CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation

Computer Science > Machine Learning arXiv:2603.00039 (cs) [Submitted on 9 Feb 2026] Title:CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation Authors:Jitian Zhao, Changho Shin, Tzu-Heng Huang, Satya Sai Srinath Namburi GNVV, Frederic Sala View a PDF of the paper titled CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation, by Jitian Zhao and 4 other authors View PDF HTML (experimental) Abstract:LLM-as-a-judge ensembles are the standard paradigm for scalable evaluation, but their aggregation mechanisms suffer from a fundamental flaw: they implicitly assume that judges provide independent estimates of true quality. However, in practice, LLM judges exhibit correlated errors caused by shared latent confounders -- such as verbosity, stylistic preferences, or training artifacts -- causing standard aggregation rules like majority vote or averaging to provide little gain or even amplify systematic mistakes. To address this, we introduce CARE, a confounder-aware aggregation framework that explicitly models LLM judge scores as arising from both a latent true-quality signal and shared confounding factors. Rather than heuristically re-weighting judges, CARE separates quality from confounders without access to ground-truth labels. We provide theoretical guarantees for identifiability and finite-sample recovery under shared confounders, and we quantify the systematic bias incurred when aggregation models omit confounding latent factors. Across 12 public benchmarks sp...

Originally published on March 03, 2026. Curated by AI News.

Related Articles

Popular AI gateway startup LiteLLM ditches controversial startup Delve | TechCrunch
Llms

Popular AI gateway startup LiteLLM ditches controversial startup Delve | TechCrunch

LiteLLM had obtained two security compliance certifications via Delve and fell victim to some horrific credential-stealing malware last w...

TechCrunch - AI · 3 min ·
Llms

Von Hammerstein’s Ghost: What a Prussian General’s Officer Typology Can Teach Us About AI Misalignment

Greetings all - I've posted mostly in r/claudecode and r/aigamedev a couple of times previously. Working with CC for personal projects re...

Reddit - Artificial Intelligence · 1 min ·
Llms

World models will be the next big thing, bye-bye LLMs

Was at Nvidia's GTC conference recently and honestly, it was one of the most eye-opening events I've attended in a while. There was a lot...

Reddit - Artificial Intelligence · 1 min ·
Llms

we open sourced a tool that auto generates your AI agent context from your actual codebase, just hit 250 stars

hey everyone. been lurking here for a while and wanted to share something we been building. the problem: ai coding agents are only as goo...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime