[2602.15532] Quantifying construct validity in large language model evaluations
Summary
This paper presents a structured capabilities model to improve the construct validity of large language model (LLM) evaluations, addressing issues with existing benchmark methods.
Why It Matters
Understanding construct validity in LLM evaluations is crucial for accurately assessing model capabilities. This research highlights the limitations of current methods and proposes a new model that enhances predictive power, which is essential for advancing AI research and applications.
Key Takeaways
- Current benchmark methods often conflate model performance with capabilities.
- Existing models like latent factor models and scaling laws have significant limitations.
- The structured capabilities model offers a more interpretable and generalizable approach.
- This new model improves predictive accuracy for out-of-distribution benchmarks.
- Separating model scale from capabilities is key to enhancing construct validity.
Computer Science > Artificial Intelligence arXiv:2602.15532 (cs) [Submitted on 17 Feb 2026] Title:Quantifying construct validity in large language model evaluations Authors:Ryan Othniel Kearns View a PDF of the paper titled Quantifying construct validity in large language model evaluations, by Ryan Othniel Kearns View PDF HTML (experimental) Abstract:The LLM community often reports benchmark results as if they are synonymous with general model capabilities. However, benchmarks can have problems that distort performance, like test set contamination and annotator error. How can we know that a benchmark is a reliable indicator of some capability that we want to measure? This question concerns the construct validity of LLM benchmarks, and it requires separating benchmark results from capabilities when we model and predict LLM performance. Both social scientists and computer scientists propose formal models - latent factor models and scaling laws - for identifying the capabilities underlying benchmark scores. However, neither technique is satisfactory for construct validity. Latent factor models ignore scaling laws, and as a result, the capabilities they extract often proxy model size. Scaling laws ignore measurement error, and as a result, the capabilities they extract are both uninterpretable and overfit to the observed benchmarks. This thesis presents the structured capabilities model, the first model to extract interpretable and generalisable capabilities from a large colle...