[2603.22214] Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models
About this article
Abstract page for arXiv paper 2603.22214: Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models
Computer Science > Cryptography and Security arXiv:2603.22214 (cs) [Submitted on 23 Mar 2026] Title:Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models Authors:Tom Biskupski, Stephan Kleber View a PDF of the paper titled Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models, by Tom Biskupski and Stephan Kleber View PDF HTML (experimental) Abstract:A Large Language Model (LLM) as judge evaluates the quality of victim Machine Learning (ML) models, specifically LLMs, by analyzing their outputs. An LLM as judge is the combination of one model and one specifically engineered judge prompt that contains the criteria for the analysis. The resulting automation of the analysis scales up the complex evaluation of the victim models' free-form text outputs by faster and more consistent judgments compared to human reviewers. Thus, quality and security assessments of LLMs can cover a wide range of the victim models' use cases. Being a comparably new technique, LLMs as judges lack a thorough investigation for their reliability and agreement to human judgment. Our work evaluates the applicability of LLMs as automated quality assessors of victim LLMs. We test the efficacy of 37 differently sized conversational LLMs in combination with 5 different judge prompts, the concept of a second-level judge, and 5 models fine-tuned for the task as assessors. As assessment objective, we curate datasets for eight differe...