[2602.13626] Benchmark Leakage Trap: Can We Trust LLM-based Recommendation?
Summary
This paper examines benchmark data leakage in LLM-based recommendation systems, revealing how it can distort performance metrics and mislead evaluations.
Why It Matters
As LLMs become integral to recommendation systems, understanding benchmark data leakage is crucial for ensuring reliable evaluations. This study highlights a significant issue that can lead to inflated performance metrics, affecting trust in AI recommendations and their applications across industries.
Key Takeaways
- Benchmark data leakage can inflate performance metrics in LLMs.
- Domain-relevant leakage leads to misleading performance gains.
- Domain-irrelevant leakage typically degrades recommendation accuracy.
- The study emphasizes the need for rigorous evaluation methods.
- Understanding leakage is essential for improving AI trustworthiness.
Computer Science > Machine Learning arXiv:2602.13626 (cs) [Submitted on 14 Feb 2026] Title:Benchmark Leakage Trap: Can We Trust LLM-based Recommendation? Authors:Mingqiao Zhang, Qiyao Peng, Yumeng Wang, Chunyuan Liu, Hongtao Liu View a PDF of the paper titled Benchmark Leakage Trap: Can We Trust LLM-based Recommendation?, by Mingqiao Zhang and 3 other authors View PDF HTML (experimental) Abstract:The expanding integration of Large Language Models (LLMs) into recommender systems poses critical challenges to evaluation reliability. This paper identifies and investigates a previously overlooked issue: benchmark data leakage in LLM-based recommendation. This phenomenon occurs when LLMs are exposed to and potentially memorize benchmark datasets during pre-training or fine-tuning, leading to artificially inflated performance metrics that fail to reflect true model performance. To validate this phenomenon, we simulate diverse data leakage scenarios by conducting continued pre-training of foundation models on strategically blended corpora, which include user-item interactions from both in-domain and out-of-domain sources. Our experiments reveal a dual-effect of data leakage: when the leaked data is domain-relevant, it induces substantial but spurious performance gains, misleadingly exaggerating the model's capability. In contrast, domain-irrelevant leakage typically degrades recommendation accuracy, highlighting the complex and contingent nature of this contamination. Our findings...