[2509.24210] BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models
About this article
Abstract page for arXiv paper 2509.24210: BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models
Computer Science > Computation and Language arXiv:2509.24210 (cs) [Submitted on 29 Sep 2025 (v1), last revised 4 Mar 2026 (this version, v2)] Title:BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models Authors:Gaurav Srivastava, Aafiya Hussain, Zhenyu Bi, Swastik Roy, Priya Pitre, Meng Lu, Morteza Ziyadi, Xuan Wang View a PDF of the paper titled BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models, by Gaurav Srivastava and 7 other authors View PDF Abstract:Evaluating language models fairly is increasingly difficult as static benchmarks risk contamination by training data, obscuring whether models truly reason or recall. We introduce BeyondBench, an evaluation framework using algorithmic problem generation to create mathematically grounded problems on the fly, ensuring each test remains uncontaminated. Our framework covers 44 algorithmic tasks with 117 variations across three difficulty levels: the Easy Suite (29 tasks) for arithmetic and statistics, the Medium Suite (5 tasks, 49 variations) for sequence patterns and reasoning, and the Hard Suite (10 tasks, 68 variations) for NP-complete and constraint satisfaction problems. Each task draws from a space exceeding 10^15 unique instances, with deterministically verified solutions. We evaluated 101 language models (85 open-source, 16 closed-source), spanning 0.5B to 141B parameters and multiple quantization schemes, using three-fold evaluation for robustness. Results reveal ...