[2603.24999] Efficient Detection of Bad Benchmark Items with Novel Scalability Coefficients
About this article
Abstract page for arXiv paper 2603.24999: Efficient Detection of Bad Benchmark Items with Novel Scalability Coefficients
Statistics > Applications arXiv:2603.24999 (stat) [Submitted on 26 Mar 2026] Title:Efficient Detection of Bad Benchmark Items with Novel Scalability Coefficients Authors:Michael Hardy, Joshua Gilbert, Benjamin Domingue View a PDF of the paper titled Efficient Detection of Bad Benchmark Items with Novel Scalability Coefficients, by Michael Hardy and 2 other authors View PDF Abstract:The validity of assessments, from large-scale AI benchmarks to human classrooms, depends on the quality of individual items, yet modern evaluation instruments often contain thousands of items with minimal psychometric vetting. We introduce a new family of nonparametric scalability coefficients based on interitem isotonic regression for efficiently detecting globally bad items (e.g., miskeyed, ambiguously worded, or construct-misaligned). The central contribution is the signed isotonic $R^2$, which measures the maximal proportion of variance in one item explainable by a monotone function of another while preserving the direction of association via Kendall's $\tau$. Aggregating these pairwise coefficients yields item-level scores that sharply separate problematic items from acceptable ones without assuming linearity or committing to a parametric item response model. We show that the signed isotonic $R^2$ is extremal among monotone predictors (it extracts the strongest possible monotone signal between any two items) and show that this optimality property translates directly into practical screening...