[2603.00077] Autorubric: A Unified Framework for Rubric-Based LLM

[2603.00077] Autorubric: A Unified Framework for Rubric-Based LLM Evaluation

arXiv - AI March 03, 2026 4 min read

About this article

Abstract page for arXiv paper 2603.00077: Autorubric: A Unified Framework for Rubric-Based LLM Evaluation

Computer Science > Computation and Language arXiv:2603.00077 (cs) [Submitted on 13 Feb 2026] Title:Autorubric: A Unified Framework for Rubric-Based LLM Evaluation Authors:Delip Rao, Chris Callison-Burch View a PDF of the paper titled Autorubric: A Unified Framework for Rubric-Based LLM Evaluation, by Delip Rao and 1 other authors View PDF HTML (experimental) Abstract:Rubric-based evaluation with large language models (LLMs) has become standard practice for assessing text generation at scale, yet the underlying techniques are scattered across papers with inconsistent terminology and partial solutions. We present a unified framework: each identified technique is paired with its realization in Autorubric, an open-source Python framework proposed in this paper. Autorubric supports binary, ordinal, and nominal criteria with configurable weights; single-judge and multi-judge ensemble evaluation with majority, weighted, unanimous, and any-vote aggregation; few-shot calibration with verdict-balanced sampling; and mitigations for position bias (option shuffling), verbosity bias (length penalties), and criterion conflation (per-criterion atomic evaluation with natural language explanations). The framework provides reliability metrics drawn from psychometrics (Cohen's $\kappa$, weighted $\kappa$, correlation coefficients, and distribution-level tests) alongside production infrastructure including response caching, checkpointing with resumable runs, multi-provider rate limiting, and c...

Originally published on March 03, 2026. Curated by AI News.

Llms

Popular AI gateway startup LiteLLM ditches controversial startup Delve | TechCrunch

LiteLLM had obtained two security compliance certifications via Delve and fell victim to some horrific credential-stealing malware last w...

TechCrunch - AI · 3 min · about 1 hour ago

Llms

Von Hammerstein’s Ghost: What a Prussian General’s Officer Typology Can Teach Us About AI Misalignment

Greetings all - I've posted mostly in r/claudecode and r/aigamedev a couple of times previously. Working with CC for personal projects re...

Reddit - Artificial Intelligence · 1 min · about 2 hours ago

Llms

World models will be the next big thing, bye-bye LLMs

Was at Nvidia's GTC conference recently and honestly, it was one of the most eye-opening events I've attended in a while. There was a lot...

Reddit - Artificial Intelligence · 1 min · about 4 hours ago

Llms

we open sourced a tool that auto generates your AI agent context from your actual codebase, just hit 250 stars

hey everyone. been lurking here for a while and wanted to share something we been building. the problem: ai coding agents are only as goo...

Reddit - Artificial Intelligence · 1 min · about 4 hours ago

[2603.00077] Autorubric: A Unified Framework for Rubric-Based LLM Evaluation

About this article

Related Articles

Popular AI gateway startup LiteLLM ditches controversial startup Delve | TechCrunch

Von Hammerstein’s Ghost: What a Prussian General’s Officer Typology Can Teach Us About AI Misalignment

World models will be the next big thing, bye-bye LLMs

we open sourced a tool that auto generates your AI agent context from your actual codebase, just hit 250 stars

No comments

Stay updated with AI News