[2602.19641] Evaluating the Impact of Data Anonymization on Image Retrieval
Summary
This article evaluates how data anonymization affects the performance of Content-Based Image Retrieval (CBIR) systems, highlighting the balance between privacy and retrieval accuracy.
Why It Matters
As privacy regulations like GDPR become more stringent, understanding the implications of data anonymization on machine learning systems is crucial. This study provides insights into maintaining performance in CBIR while adhering to privacy standards, which is increasingly relevant for organizations handling sensitive visual data.
Key Takeaways
- Anonymization can negatively impact CBIR system performance.
- The study proposes a framework to evaluate retrieval results post-anonymization.
- Results indicate a bias favoring models trained on original data.
- The findings are relevant for developing privacy-compliant CBIR systems.
- Three anonymization methods and four training strategies were assessed.
Computer Science > Machine Learning arXiv:2602.19641 (cs) [Submitted on 23 Feb 2026] Title:Evaluating the Impact of Data Anonymization on Image Retrieval Authors:Marvin Chen, Manuel Eberhardinger, Johannes Maucher View a PDF of the paper titled Evaluating the Impact of Data Anonymization on Image Retrieval, by Marvin Chen and 2 other authors View PDF HTML (experimental) Abstract:With the growing importance of privacy regulations such as the General Data Protection Regulation, anonymizing visual data is becoming increasingly relevant across institutions. However, anonymization can negatively affect the performance of Computer Vision systems that rely on visual features, such as Content-Based Image Retrieval (CBIR). Despite this, the impact of anonymization on CBIR has not been systematically studied. This work addresses this gap, motivated by the DOKIQ project, an artificial intelligence-based system for document verification actively used by the State Criminal Police Office Baden-Württemberg. We propose a simple evaluation framework: retrieval results after anonymization should match those obtained before anonymization as closely as possible. To this end, we systematically assess the impact of anonymization using two public datasets and the internal DOKIQ dataset. Our experiments span three anonymization methods, four anonymization degrees, and four training strategies, all based on the state of the art backbone Self-Distillation with No Labels (DINO)v2. Our results reveal...