[2510.02386] On The Fragility of Benchmark Contamination Detection in Reasoning Models
About this article
Abstract page for arXiv paper 2510.02386: On The Fragility of Benchmark Contamination Detection in Reasoning Models
Computer Science > Cryptography and Security arXiv:2510.02386 (cs) [Submitted on 30 Sep 2025 (v1), last revised 2 Mar 2026 (this version, v2)] Title:On The Fragility of Benchmark Contamination Detection in Reasoning Models Authors:Han Wang, Haoyu Li, Brian Ko, Huan Zhang View a PDF of the paper titled On The Fragility of Benchmark Contamination Detection in Reasoning Models, by Han Wang and 3 other authors View PDF HTML (experimental) Abstract:Leaderboards for LRMs have turned evaluation into a competition, incentivizing developers to optimize directly on benchmark suites. A shortcut to achieving higher rankings is to incorporate evaluation benchmarks into the training data, thereby yielding inflated performance, known as benchmark contamination. Surprisingly, our studies find that evading contamination detections for LRMs is alarmingly easy. We focus on the two scenarios where contamination may occur in practice: (I) when the base model evolves into LRM via SFT and RL, we find that contamination during SFT can be originally identified by contamination detection methods. Yet, even a brief GRPO training can markedly conceal contamination signals that most detection methods rely on. Further empirical experiments and theoretical analysis indicate that PPO style importance sampling and clipping objectives are the root cause of this detection concealment, indicating that a broad class of RL methods may inherently exhibit similar concealment capability; (II) when SFT contaminati...