Data Science Ai Safety Machine Learning

[2504.21035] A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage

arXiv - Machine Learning February 23, 2026 4 min read Article

Summary

This article evaluates the effectiveness of textual data sanitization methods, revealing that current techniques may provide a false sense of privacy by failing to protect against nuanced re-identification risks.

Why It Matters

As data privacy concerns grow, understanding the limitations of sanitization methods is crucial for developers and organizations handling sensitive information. This research highlights the inadequacies of existing tools and emphasizes the need for improved privacy protection strategies.

Key Takeaways

Current sanitization methods often fail to protect against nuanced re-identification risks.
Auxiliary information can reveal sensitive attributes even after PII removal.
Differential privacy can mitigate some risks but may reduce data utility.
The study demonstrates that existing tools, like Azure's PII removal, are not fully effective.
A new framework for evaluating privacy risks is proposed to enhance data protection.

Computer Science > Cryptography and Security arXiv:2504.21035 (cs) [Submitted on 28 Apr 2025 (v1), last revised 19 Feb 2026 (this version, v3)] Title:A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage Authors:Rui Xin, Niloofar Mireshghallah, Shuyue Stella Li, Michael Duan, Hyunwoo Kim, Yejin Choi, Yulia Tsvetkov, Sewoong Oh, Pang Wei Koh View a PDF of the paper titled A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage, by Rui Xin and 8 other authors View PDF Abstract:Sanitizing sensitive text data typically involves removing personally identifiable information (PII) or generating synthetic data under the assumption that these methods adequately protect privacy; however, their effectiveness is often only assessed by measuring the leakage of explicit identifiers but ignoring nuanced textual markers that can lead to re-identification. We challenge the above illusion of privacy by proposing a new framework that evaluates re-identification attacks to quantify individual privacy risks upon data release. Our approach shows that seemingly innocuous auxiliary information -- such as routine social activities -- can be used to infer sensitive attributes like age or substance use history from sanitized data. For instance, we demonstrate that Azure's commercial PII removal tool fails to protect 74\% of information in the MedQA dataset. Although differential privacy mitigates these risks to...

Read Original Article