Llms Machine Learning Ai Safety Ai Startups Generative Ai

[2602.13576] Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

arXiv - AI February 17, 2026 4 min read Article

Summary

The paper identifies a vulnerability in large language model (LLM) evaluation processes, termed Rubric-Induced Preference Drift (RIPD), which can lead to biased judgments and reduced accuracy in model behavior.

Why It Matters

As LLMs are increasingly used in decision-making, understanding the risks associated with their evaluation rubrics is crucial. This research highlights a significant vulnerability that can affect model performance and reliability, raising concerns about the integrity of AI systems.

Key Takeaways

Rubric-Induced Preference Drift (RIPD) can occur even with validated rubric edits.
RIPD can lead to systematic biases in LLM judgments, affecting downstream applications.
The study reveals how seemingly benign rubric changes can significantly impact model accuracy.
Rubrics serve as a manipulable control interface, posing alignment risks.
Understanding these vulnerabilities is essential for improving AI evaluation methods.

Computer Science > Cryptography and Security arXiv:2602.13576 (cs) [Submitted on 14 Feb 2026] Title:Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges Authors:Ruomeng Ding, Yifei Pang, He Sun, Yizhong Wang, Zhiwei Steven Wu, Zhun Deng View a PDF of the paper titled Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges, by Ruomeng Ding and 5 other authors View PDF HTML (experimental) Abstract:Evaluation and alignment pipelines for large language models increasingly rely on LLM-based judges, whose behavior is guided by natural-language rubrics and validated on benchmarks. We identify a previously under-recognized vulnerability in this workflow, which we term Rubric-Induced Preference Drift (RIPD). Even when rubric edits pass benchmark validation, they can still produce systematic and directional shifts in a judge's preferences on target domains. Because rubrics serve as a high-level decision interface, such drift can emerge from seemingly natural, criterion-preserving edits and remain difficult to detect through aggregate benchmark metrics or limited spot-checking. We further show this vulnerability can be exploited through rubric-based preference attacks, in which benchmark-compliant rubric edits steer judgments away from a fixed human or trusted reference on target domains, systematically inducing RIPD and reducing target-domain accuracy up to 9.5% (helpfulness) and 27.9% (harmlessness). When these judgments are used to generate preferenc...

Read Original Article

[2602.13576] Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

Summary

Why It Matters

Key Takeaways

Related Articles

Kept hitting ChatGPT and Claude limits during real work. This is the free setup I ended up using

Is ChatGPT changing the way we think too much already?

Will people continue paying for the plans after the honeymoon is over?

Nvidia goes all-in on AI agents while Anthropic pulls the plug

No comments

Stay updated with AI News