[2510.12462] Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems
Summary
This article evaluates biases in Large Language Models (LLMs) used as judges in communication systems, assessing their reliability and proposing mitigation strategies.
Why It Matters
As LLMs increasingly evaluate content in communication systems, understanding and mitigating biases is crucial to ensure fair outcomes and maintain user trust. This research highlights the risks associated with biased evaluations and offers strategies to enhance the integrity of AI judgments.
Key Takeaways
- LLMs can exhibit biases in evaluating content, impacting trust.
- State-of-the-art LLM judges generally score biased inputs lower.
- Fine-tuning on biased data can degrade LLM performance.
- Task difficulty influences judged scores significantly.
- Four mitigation strategies are proposed to enhance fairness.
Computer Science > Artificial Intelligence arXiv:2510.12462 (cs) [Submitted on 14 Oct 2025 (v1), last revised 24 Feb 2026 (this version, v2)] Title:Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems Authors:Jiaxin Gao, Chen Chen, Yanwen Jia, Xueluan Gong, Kwok-Yan Lam, Qian Wang View a PDF of the paper titled Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems, by Jiaxin Gao and 5 other authors View PDF HTML (experimental) Abstract:Large Language Models (LLMs) are increasingly being used to autonomously evaluate the quality of content in communication systems, e.g., to assess responses in telecom customer support chatbots. However, the impartiality of these AI "judges" is not guaranteed, and any biases in their evaluation criteria could skew outcomes and undermine user trust. In this paper, we systematically investigate judgment biases in two LLM-as-a-judge models (i.e., GPT-Judge and JudgeLM) under the point-wise scoring setting, encompassing 11 types of biases that cover both implicit and explicit forms. We observed that state-of-the-art LLM judges demonstrate robustness to biased inputs, generally assigning them lower scores than the corresponding clean samples. Providing a detailed scoring rubric further enhances this robustness. We further found that fine-tuning an LLM on high-scoring yet biased responses can significantly degrade its performance, highlighting the risk of training on biased data. We also discovered that the judge...