[2603.13285] Brittlebench: Quantifying LLM robustness via prompt sensitivity
About this article
Abstract page for arXiv paper 2603.13285: Brittlebench: Quantifying LLM robustness via prompt sensitivity
Computer Science > Machine Learning arXiv:2603.13285 (cs) [Submitted on 27 Feb 2026 (v1), last revised 6 Apr 2026 (this version, v2)] Title:Brittlebench: Quantifying LLM robustness via prompt sensitivity Authors:Angelika Romanou, Mark Ibrahim, Candace Ross, Chantal Shaib, Kerem Oktar, Samuel J. Bell, Anaelia Ovalle, Jesse Dodge, Antoine Bosselut, Koustuv Sinha, Adina Williams View a PDF of the paper titled Brittlebench: Quantifying LLM robustness via prompt sensitivity, by Angelika Romanou and 10 other authors View PDF HTML (experimental) Abstract:Existing evaluation methods largely rely on clean, static benchmarks, which can overestimate true model performance by failing to capture the noise and variability inherent in real-world user inputs. This is especially true for language models, which can face human-generated text queries containing mistakes, typos, or alternative ways of phrasing the same question. In this work, we introduce a theoretical framework for quantifying model sensitivity to prompt variants, or brittleness, that can enable us to disentangle data-induced difficulty from prompt-related variability. Using this framework, we design a novel evaluation pipeline, Brittlebench, to holistically evaluate the sensitivity of frontier models. We apply semantics-preserving perturbations to a suite of popular benchmarks, and observe model performance to degrade as much as 12%. However, these perturbations do not affect all models equally: even a single perturbation al...