[2602.12424] RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty
Summary
The paper introduces RankLLM, a framework for evaluating large language models (LLMs) by quantifying question difficulty, enhancing model comparison and competency assessment.
Why It Matters
RankLLM addresses a critical gap in existing LLM benchmarks by incorporating question difficulty as a key metric, enabling more nuanced evaluations of model performance. This innovation is essential for advancing AI capabilities and ensuring effective model deployment in real-world applications.
Key Takeaways
- RankLLM quantifies question difficulty to improve LLM evaluation.
- The framework shows 90% agreement with human judgments.
- RankLLM outperforms existing baselines like IRT.
- It offers fast convergence and high computational efficiency.
- The approach facilitates better model comparisons across diverse domains.
Computer Science > Computation and Language arXiv:2602.12424 (cs) [Submitted on 12 Feb 2026] Title:RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty Authors:Ziqian Zhang, Xingjian Hu, Yue Huang, Kai Zhang, Ruoxi Chen, Yixin Liu, Qingsong Wen, Kaidi Xu, Xiangliang Zhang, Neil Zhenqiang Gong, Lichao Sun View a PDF of the paper titled RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty, by Ziqian Zhang and 10 other authors View PDF HTML (experimental) Abstract:Benchmarks establish a standardized evaluation framework to systematically assess the performance of large language models (LLMs), facilitating objective comparisons and driving advancements in the field. However, existing benchmarks fail to differentiate question difficulty, limiting their ability to effectively distinguish models' capabilities. To address this limitation, we propose RankLLM, a novel framework designed to quantify both question difficulty and model competency. RankLLM introduces difficulty as the primary criterion for differentiation, enabling a more fine-grained evaluation of LLM capabilities. RankLLM's core mechanism facilitates bidirectional score propagation between models and questions. The core intuition of RankLLM is that a model earns a competency score when it correctly answers a question, while a question's difficulty score increases when it challenges a model. Using this framework, we evaluate 30 models on 35,550 questions across multiple domains. RankLLM ...