[2602.18911] From Human-Level AI Tales to AI Leveling Human Scales
Summary
This paper proposes a framework to recalibrate AI performance metrics against a global human population scale, addressing misleading comparisons of AI to human capabilities.
Why It Matters
The study highlights the limitations of current AI benchmarking practices, which often rely on narrow human baselines. By introducing a more comprehensive calibration method, it aims to enhance the accuracy of AI assessments and improve the understanding of AI capabilities in relation to human performance.
Key Takeaways
- Current AI benchmarks can misrepresent human-level performance due to narrow data sources.
- The proposed framework calibrates AI performance against a global human population scale.
- Utilizes demographic data from various educational and reasoning benchmarks for improved accuracy.
- Introduces multi-level scales for different capabilities, enhancing the understanding of AI's strengths and weaknesses.
- The methodology allows for better standardization of AI assessments across diverse populations.
Computer Science > Machine Learning arXiv:2602.18911 (cs) [Submitted on 21 Feb 2026] Title:From Human-Level AI Tales to AI Leveling Human Scales Authors:Peter Romero, Fernando Martínez-Plumed, Zachary R. Tyler, Matthieu Téhénan, Sipeng Chen, Álvaro David Gómez Antón, Luning Sun, Manuel Cebrian, Lexin Zhou, Yael Moros Daval, Daniel Romero-Alvarado, Félix Martí Pérez, Kevin Wei, José Hernández-Orallo View a PDF of the paper titled From Human-Level AI Tales to AI Leveling Human Scales, by Peter Romero and 13 other authors View PDF HTML (experimental) Abstract:Comparing AI models to "human level" is often misleading when benchmark scores are incommensurate or human baselines are drawn from a narrow population. To address this, we propose a framework that calibrates items against the 'world population' and report performance on a common, human-anchored scale. Concretely, we build on a set of multi-level scales for different capabilities where each level should represent a probability of success of the whole world population on a logarithmic scale with a base $B$. We calibrate each scale for each capability (reasoning, comprehension, knowledge, volume, etc.) by compiling publicly released human test data spanning education and reasoning benchmarks (PISA, TIMSS, ICAR, UKBioBank, and ReliabilityBench). The base $B$ is estimated by extrapolating between samples with two demographic profiles using LLMs, with the hypothesis that they condense rich information about human populations....