[2510.02386] On The Fragility of Benchmark Contamination Detection in

[2510.02386] On The Fragility of Benchmark Contamination Detection in Reasoning Models

arXiv - Machine Learning March 03, 2026 4 min read

About this article

Abstract page for arXiv paper 2510.02386: On The Fragility of Benchmark Contamination Detection in Reasoning Models

Computer Science > Cryptography and Security arXiv:2510.02386 (cs) [Submitted on 30 Sep 2025 (v1), last revised 2 Mar 2026 (this version, v2)] Title:On The Fragility of Benchmark Contamination Detection in Reasoning Models Authors:Han Wang, Haoyu Li, Brian Ko, Huan Zhang View a PDF of the paper titled On The Fragility of Benchmark Contamination Detection in Reasoning Models, by Han Wang and 3 other authors View PDF HTML (experimental) Abstract:Leaderboards for LRMs have turned evaluation into a competition, incentivizing developers to optimize directly on benchmark suites. A shortcut to achieving higher rankings is to incorporate evaluation benchmarks into the training data, thereby yielding inflated performance, known as benchmark contamination. Surprisingly, our studies find that evading contamination detections for LRMs is alarmingly easy. We focus on the two scenarios where contamination may occur in practice: (I) when the base model evolves into LRM via SFT and RL, we find that contamination during SFT can be originally identified by contamination detection methods. Yet, even a brief GRPO training can markedly conceal contamination signals that most detection methods rely on. Further empirical experiments and theoretical analysis indicate that PPO style importance sampling and clipping objectives are the root cause of this detection concealment, indicating that a broad class of RL methods may inherently exhibit similar concealment capability; (II) when SFT contaminati...

Originally published on March 03, 2026. Curated by AI News.

Machine Learning

[P] Unix philosophy for ML pipelines: modular, swappable stages with typed contracts

We built an open-source prototype that applies Unix philosophy to retrieval pipelines. Each stage (PII redaction, chunking, dedup, embedd...

Reddit - Machine Learning · 1 min · about 1 hour ago

Machine Learning

Making an AI native sovereign computational stack

I’ve been working on a personal project that ended up becoming a kind of full computing stack: identity / trust protocol decentralized ch...

Reddit - Artificial Intelligence · 1 min · about 2 hours ago

Llms

An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

https://shapingrooms.com/research I published a paper today on something I've been calling postural manipulation. The short version: ordi...

Reddit - Artificial Intelligence · 1 min · about 2 hours ago

Machine Learning

What tools are sr MLEs using? (clawdbot, openspec, wispr) [D]

I'm already blasting cursor, but I want to level up my output. I heard that these kind of AI tools and workflows are being asked in SF. W...

Reddit - Machine Learning · 1 min · about 3 hours ago

[2510.02386] On The Fragility of Benchmark Contamination Detection in Reasoning Models

About this article

Related Articles

[P] Unix philosophy for ML pipelines: modular, swappable stages with typed contracts

Making an AI native sovereign computational stack

An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

What tools are sr MLEs using? (clawdbot, openspec, wispr) [D]

No comments

Stay updated with AI News