[2602.13576] Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges
Summary
The paper identifies a vulnerability in large language model (LLM) evaluation processes, termed Rubric-Induced Preference Drift (RIPD), which can lead to biased judgments and reduced accuracy in model behavior.
Why It Matters
As LLMs are increasingly used in decision-making, understanding the risks associated with their evaluation rubrics is crucial. This research highlights a significant vulnerability that can affect model performance and reliability, raising concerns about the integrity of AI systems.
Key Takeaways
- Rubric-Induced Preference Drift (RIPD) can occur even with validated rubric edits.
- RIPD can lead to systematic biases in LLM judgments, affecting downstream applications.
- The study reveals how seemingly benign rubric changes can significantly impact model accuracy.
- Rubrics serve as a manipulable control interface, posing alignment risks.
- Understanding these vulnerabilities is essential for improving AI evaluation methods.
Computer Science > Cryptography and Security arXiv:2602.13576 (cs) [Submitted on 14 Feb 2026] Title:Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges Authors:Ruomeng Ding, Yifei Pang, He Sun, Yizhong Wang, Zhiwei Steven Wu, Zhun Deng View a PDF of the paper titled Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges, by Ruomeng Ding and 5 other authors View PDF HTML (experimental) Abstract:Evaluation and alignment pipelines for large language models increasingly rely on LLM-based judges, whose behavior is guided by natural-language rubrics and validated on benchmarks. We identify a previously under-recognized vulnerability in this workflow, which we term Rubric-Induced Preference Drift (RIPD). Even when rubric edits pass benchmark validation, they can still produce systematic and directional shifts in a judge's preferences on target domains. Because rubrics serve as a high-level decision interface, such drift can emerge from seemingly natural, criterion-preserving edits and remain difficult to detect through aggregate benchmark metrics or limited spot-checking. We further show this vulnerability can be exploited through rubric-based preference attacks, in which benchmark-compliant rubric edits steer judgments away from a fixed human or trusted reference on target domains, systematically inducing RIPD and reducing target-domain accuracy up to 9.5% (helpfulness) and 27.9% (harmlessness). When these judgments are used to generate preferenc...