Llms Machine Learning Ai Infrastructure Ai Safety Nlp

[2602.16980] Discovering Universal Activation Directions for PII Leakage in Language Models

arXiv - Machine Learning February 20, 2026 3 min read Article

Summary

The paper introduces UniLeak, a framework that identifies universal activation directions in language models, enhancing the understanding of PII leakage and its modulation within model representations.

Why It Matters

As language models become more prevalent, understanding how they handle personally identifiable information (PII) is crucial for privacy and security. This research provides insights into the mechanisms behind PII leakage, offering potential pathways for risk mitigation and improved model safety.

Key Takeaways

UniLeak identifies universal activation directions that increase PII leakage.
The framework operates without needing access to training data or groundtruth PII.
Steering along these directions amplifies PII generation probability with minimal impact on output quality.
The findings highlight a new perspective on PII leakage as a latent signal in model representations.
The research suggests methods for both amplifying and mitigating PII risks in language models.

Computer Science > Machine Learning arXiv:2602.16980 (cs) [Submitted on 19 Feb 2026] Title:Discovering Universal Activation Directions for PII Leakage in Language Models Authors:Leo Marchyok, Zachary Coalson, Sungho Keum, Sooel Son, Sanghyun Hong View a PDF of the paper titled Discovering Universal Activation Directions for PII Leakage in Language Models, by Leo Marchyok and 4 other authors View PDF HTML (experimental) Abstract:Modern language models exhibit rich internal structure, yet little is known about how privacy-sensitive behaviors, such as personally identifiable information (PII) leakage, are represented and modulated within their hidden states. We present UniLeak, a mechanistic-interpretability framework that identifies universal activation directions: latent directions in a model's residual stream whose linear addition at inference time consistently increases the likelihood of generating PII across prompts. These model-specific directions generalize across contexts and amplify PII generation probability, with minimal impact on generation quality. UniLeak recovers such directions without access to training data or groundtruth PII, relying only on self-generated text. Across multiple models and datasets, steering along these universal directions substantially increases PII leakage compared to existing prompt-based extraction methods. Our results offer a new perspective on PII leakage: the superposition of a latent signal in the model's representations, enabling b...

Read Original Article

[2602.16980] Discovering Universal Activation Directions for PII Leakage in Language Models

Summary

Why It Matters

Key Takeaways

Related Articles

Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED

People anxious about deviating from what AI tells them to do?

ChatGPT on trial: A landmark test of AI liability in the practice of law

What if Claude purposefully made its own code leakable so that it would get leaked

No comments

Stay updated with AI News