[2603.03319] Automated Concept Discovery for LLM-as-a-Judge Preference Analysis
About this article
Abstract page for arXiv paper 2603.03319: Automated Concept Discovery for LLM-as-a-Judge Preference Analysis
Computer Science > Computation and Language arXiv:2603.03319 (cs) [Submitted on 9 Feb 2026] Title:Automated Concept Discovery for LLM-as-a-Judge Preference Analysis Authors:James Wedgwood, Chhavi Yadav, Virginia Smith View a PDF of the paper titled Automated Concept Discovery for LLM-as-a-Judge Preference Analysis, by James Wedgwood and 2 other authors View PDF HTML (experimental) Abstract:Large Language Models (LLMs) are increasingly used as scalable evaluators of model outputs, but their preference judgments exhibit systematic biases and can diverge from human evaluations. Prior work on LLM-as-a-judge has largely focused on a small, predefined set of hypothesized biases, leaving open the problem of automatically discovering unknown drivers of LLM preferences. We address this gap by studying several embedding-level concept extraction methods for analyzing LLM judge behavior. We compare these methods in terms of interpretability and predictiveness, finding that sparse autoencoder-based approaches recover substantially more interpretable preference features than alternatives while remaining competitive in predicting LLM decisions. Using over 27k paired responses from multiple human preference datasets and judgments from three LLMs, we analyze LLM judgments and compare them to those of human annotators. Our method both validates existing results, such as the tendency for LLMs to prefer refusal of sensitive requests at higher rates than humans, and uncovers new trends across ...