[2511.04934] Leak@$k$: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding
Summary
The paper discusses the limitations of current unlearning methods in large language models (LLMs), revealing that they fail to effectively erase sensitive information when using probabilistic decoding. It introduces a new metric, leak@$k$, to evaluate unlearning reliability an...
Why It Matters
As LLMs become integral to applications involving sensitive data, ensuring they can forget information is crucial for compliance and ethical standards. This research highlights significant gaps in existing unlearning techniques, emphasizing the need for more robust solutions to protect user privacy and data integrity.
Key Takeaways
- Current unlearning methods in LLMs are largely ineffective under probabilistic decoding.
- The new leak@$k$ metric provides a systematic way to evaluate unlearning reliability.
- The proposed RULE algorithm demonstrates improved performance in preventing knowledge leakage.
Computer Science > Machine Learning arXiv:2511.04934 (cs) [Submitted on 7 Nov 2025 (v1), last revised 21 Feb 2026 (this version, v2)] Title:Leak@$k$: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding Authors:Hadi Reisizadeh, Jiajun Ruan, Yiwei Chen, Soumyadeep Pal, Sijia Liu, Mingyi Hong View a PDF of the paper titled Leak@$k$: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding, by Hadi Reisizadeh and 5 other authors View PDF HTML (experimental) Abstract:Unlearning in large language models (LLMs) is critical for regulatory compliance and for building ethical generative AI systems that avoid producing private, toxic, illegal, or copyrighted content. Despite rapid progress, in this work we show that \textit{almost all} existing unlearning methods fail to achieve true forgetting in practice. Specifically, while evaluations of these `unlearned' models under deterministic (greedy) decoding often suggest successful knowledge removal using standard benchmarks (as has been done in the literature), we show that sensitive information reliably resurfaces when models are sampled with standard probabilistic decoding. To rigorously capture this vulnerability, we introduce \texttt{leak@$k$}, a new meta-evaluation metric that quantifies the likelihood of forgotten knowledge reappearing when generating $k$ samples from the model under realistic decoding strategies. Using three widely adopted benchmarks, TOFU, MUSE, and WMDP, we conduct the first large-scale, ...