NPHardEval Leaderboard: Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates
About this article
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Back to Articles NPHardEval Leaderboard: Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates Published February 2, 2024 Update on GitHub Upvote 4 Lizhou Fan lizhouf Follow guest Wenyue Hua wenyueH Follow guest Haoyang Ling hyfrankl Follow guest Clémentine Fourrier clefourrier Follow We're happy to introduce the NPHardEval leaderboard, using NPHardEval, a cutting-edge benchmark developed by researchers from the University of Michigan and Rutgers University. NPHardEval introduces a dynamic, complexity-based framework for assessing Large Language Models' (LLMs) reasoning abilities. It poses 900 algorithmic questions spanning the NP-Hard complexity class and lower, designed to rigorously test LLMs, and is updated on a monthly basis to prevent overfitting! A Unique Approach to LLM Evaluation NPHardEval stands apart by employing computational complexity classes, offering a quantifiable and robust measure of LLM reasoning skills. The benchmark's tasks mirror real-world decision-making challenges, enhancing its relevance and applicability. Regular monthly updates of the benchmark data points mitigate the risk of model overfitting, ensuring a reliable evaluation. The major contributions of NPHardEval are new using new benchmarking strategies (proposing an automatic and dynamic benchmark), and introducing a new way to evaluate LLM reasoning. Regarding benchmarking strategies, NPHardEval uses an automated mechanism, both to genera...