[2603.03336] Prompt-Dependent Ranking of Large Language Models with Uncertainty Quantification
About this article
Abstract page for arXiv paper 2603.03336: Prompt-Dependent Ranking of Large Language Models with Uncertainty Quantification
Computer Science > Computation and Language arXiv:2603.03336 (cs) [Submitted on 11 Feb 2026] Title:Prompt-Dependent Ranking of Large Language Models with Uncertainty Quantification Authors:Angel Rodrigo Avelar Menendez, Yufeng Liu, Xiaowu Dai View a PDF of the paper titled Prompt-Dependent Ranking of Large Language Models with Uncertainty Quantification, by Angel Rodrigo Avelar Menendez and 2 other authors View PDF HTML (experimental) Abstract:Rankings derived from pairwise comparisons are central to many economic and computational systems. In the context of large language models (LLMs), rankings are typically constructed from human preference data and presented as leaderboards that guide deployment decisions. However, existing approaches rely on point estimates, implicitly treating rankings as fixed objects despite substantial estimation noise and context-dependent performance variation. Acting on such rankings can lead to misallocation and welfare loss when apparent differences are not statistically meaningful. We study prompt-dependent ranking inference under pairwise human preferences and develop a framework for decision-safe rankings with statistically valid uncertainty guarantees. We model preferences using a contextual Bradley-Terry-Luce model in which the latent utility of each model depends on the input prompt. Rather than targeting point estimates of utilities, we directly conduct inference on induced rankings, constructing confidence sets based on simultaneous c...