[2603.20562] Permutation-Consensus Listwise Judging for Robust Factuality Evaluation
About this article
Abstract page for arXiv paper 2603.20562: Permutation-Consensus Listwise Judging for Robust Factuality Evaluation
Computer Science > Computation and Language arXiv:2603.20562 (cs) [Submitted on 20 Mar 2026] Title:Permutation-Consensus Listwise Judging for Robust Factuality Evaluation Authors:Tianyi Huang, Nathan Huang, Justin Tang, Wenqian Chen, Elsa Fan View a PDF of the paper titled Permutation-Consensus Listwise Judging for Robust Factuality Evaluation, by Tianyi Huang and 4 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) are now widely used as judges, yet their decisions can change under presentation choices that should be irrelevant. We study one such source of instability: candidate-order sensitivity in listwise factuality evaluation, where several answers can look similarly polished while differing sharply in hallucination risk. We introduce PCFJudge, an inference-time method that reruns the same factuality-first listwise prompt over multiple orderings of the same candidate set and aggregates the resulting scores, ranks, and uncertainty signals into a single consensus decision. On RewardBench 2 Factuality, PCFJudge improves over direct judging by up to 7 absolute points. Development ablations show that the dominant gain comes from permutation consensus itself rather than from heavier arbitration layers. These results suggest that a meaningful share of factuality-judging error arises from order instability, and that averaging over this nuisance variation is a simple and effective way to make LLM evaluation more reliable. Subjects: Computation and...