[2603.00077] Autorubric: A Unified Framework for Rubric-Based LLM Evaluation
About this article
Abstract page for arXiv paper 2603.00077: Autorubric: A Unified Framework for Rubric-Based LLM Evaluation
Computer Science > Computation and Language arXiv:2603.00077 (cs) [Submitted on 13 Feb 2026] Title:Autorubric: A Unified Framework for Rubric-Based LLM Evaluation Authors:Delip Rao, Chris Callison-Burch View a PDF of the paper titled Autorubric: A Unified Framework for Rubric-Based LLM Evaluation, by Delip Rao and 1 other authors View PDF HTML (experimental) Abstract:Rubric-based evaluation with large language models (LLMs) has become standard practice for assessing text generation at scale, yet the underlying techniques are scattered across papers with inconsistent terminology and partial solutions. We present a unified framework: each identified technique is paired with its realization in Autorubric, an open-source Python framework proposed in this paper. Autorubric supports binary, ordinal, and nominal criteria with configurable weights; single-judge and multi-judge ensemble evaluation with majority, weighted, unanimous, and any-vote aggregation; few-shot calibration with verdict-balanced sampling; and mitigations for position bias (option shuffling), verbosity bias (length penalties), and criterion conflation (per-criterion atomic evaluation with natural language explanations). The framework provides reliability metrics drawn from psychometrics (Cohen's $\kappa$, weighted $\kappa$, correlation coefficients, and distribution-level tests) alongside production infrastructure including response caching, checkpointing with resumable runs, multi-provider rate limiting, and c...