[2603.22214] Evaluating the Reliability and Fidelity of Automated

[2603.22214] Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

arXiv - Machine Learning March 24, 2026 4 min read

About this article

Abstract page for arXiv paper 2603.22214: Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

Computer Science > Cryptography and Security arXiv:2603.22214 (cs) [Submitted on 23 Mar 2026] Title:Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models Authors:Tom Biskupski, Stephan Kleber View a PDF of the paper titled Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models, by Tom Biskupski and Stephan Kleber View PDF HTML (experimental) Abstract:A Large Language Model (LLM) as judge evaluates the quality of victim Machine Learning (ML) models, specifically LLMs, by analyzing their outputs. An LLM as judge is the combination of one model and one specifically engineered judge prompt that contains the criteria for the analysis. The resulting automation of the analysis scales up the complex evaluation of the victim models' free-form text outputs by faster and more consistent judgments compared to human reviewers. Thus, quality and security assessments of LLMs can cover a wide range of the victim models' use cases. Being a comparably new technique, LLMs as judges lack a thorough investigation for their reliability and agreement to human judgment. Our work evaluates the applicability of LLMs as automated quality assessors of victim LLMs. We test the efficacy of 37 differently sized conversational LLMs in combination with 5 different judge prompts, the concept of a second-level judge, and 5 models fine-tuned for the task as assessors. As assessment objective, we curate datasets for eight differe...

Originally published on March 24, 2026. Curated by AI News.

Llms

We hit 150 stars on our AI setup tool!

yo folks, we just hit 150 stars on our open source tool that auto makes AI context files. got 90 PRs merged and 20 issues that ppl are pi...

Reddit - Artificial Intelligence · 1 min · 8 minutes ago

Llms

Is ai getting dummer?

Over the past month, it feels like GPT and Gemini have been giving wrong answers a lot. Do you feel the same, or am I exaggerating? submi...

Reddit - Artificial Intelligence · 1 min · 8 minutes ago

Llms

If AI is really making us more productive... why does it feel like we are working more, not less...?

The promise of AI was the ultimate system optimisation: Efficiency. On paper, the tools are delivering something similar to what they pro...

Reddit - Artificial Intelligence · 1 min · 8 minutes ago

Llms

[R] GPT-5.4-mini regressed 22pp on vanilla prompting vs GPT-5-mini. Nobody noticed because benchmarks don't test this. Recursive Language Models solved it.

GPT-5.4-mini produces shorter, terser outputs by default. Vanilla accuracy dropped from 69.5% to 47.2% across 12 tasks (1,800 evals). The...

Reddit - Machine Learning · 1 min · about 8 hours ago

[2603.22214] Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

About this article

Related Articles

We hit 150 stars on our AI setup tool!

Is ai getting dummer?

If AI is really making us more productive... why does it feel like we are working more, not less...?

[R] GPT-5.4-mini regressed 22pp on vanilla prompting vs GPT-5-mini. Nobody noticed because benchmarks don't test this. Recursive Language Models solved it.

No comments

Stay updated with AI News