[2602.22585] Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach
Summary
This paper explores the integration of psychometric rater models into AI evaluation, aiming to correct human label biases and improve the reliability of AI assessments.
Why It Matters
Human evaluations are crucial for training AI models, yet they often suffer from systematic errors. This research introduces a method to enhance the validity of these evaluations, which is essential for developing trustworthy AI systems. By addressing rater effects, the findings could lead to more accurate AI performance assessments and better decision-making in AI development.
Key Takeaways
- Human evaluations in AI are prone to systematic errors.
- Item response theory can correct biases in human ratings.
- Adjusting for rater severity leads to more reliable quality estimates.
- The approach enhances transparency in AI evaluation processes.
- Improved evaluation methods can inform better AI development decisions.
Computer Science > Artificial Intelligence arXiv:2602.22585 (cs) [Submitted on 26 Feb 2026] Title:Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach Authors:Jodi M. Casabianca, Maggie Beiting-Parrish View a PDF of the paper titled Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach, by Jodi M. Casabianca and Maggie Beiting-Parrish View PDF HTML (experimental) Abstract:Human evaluations play a central role in training and assessing AI models, yet these data are rarely treated as measurements subject to systematic error. This paper integrates psychometric rater models into the AI pipeline to improve the reliability and validity of conclusions drawn from human judgments. The paper reviews common rater effects, severity and centrality, that distort observed ratings, and demonstrates how item response theory rater models, particularly the multi-faceted Rasch model, can separate true output quality from rater behavior. Using the OpenAI summarization dataset as an empirical example, we show how adjusting for rater severity produces corrected estimates of summary quality and provides diagnostic insight into rater performance. Incorporating psychometric modeling into human-in-the-loop evaluation offers more principled and transparent use of human data, enabling developers to make decisions based on adjusted scores rather than raw, error-prone ratings. This perspective highlights a path toward more ro...