[2602.20580] Personal Information Parroting in Language Models

[2602.20580] Personal Information Parroting in Language Models

arXiv - Machine Learning 3 min read Article

Summary

This article examines the issue of personal information memorization in language models, highlighting the risks and proposing a detection suite to mitigate these concerns.

Why It Matters

As language models increasingly incorporate vast amounts of data, understanding their propensity to memorize personal information is crucial for privacy and ethical AI development. This research provides insights into how model size and training duration affect memorization, emphasizing the need for better data handling practices.

Key Takeaways

  • Language models can memorize personal information, posing privacy risks.
  • The study introduces a regex-based detector suite that outperforms existing solutions.
  • Memorization rates increase with model size and training duration.
  • Even smaller models can parrot personal information verbatim.
  • Recommendations include filtering and anonymizing pretraining datasets.

Computer Science > Computation and Language arXiv:2602.20580 (cs) [Submitted on 24 Feb 2026] Title:Personal Information Parroting in Language Models Authors:Nishant Subramani, Kshitish Ghate, Mona Diab View a PDF of the paper titled Personal Information Parroting in Language Models, by Nishant Subramani and 2 other authors View PDF HTML (experimental) Abstract:Modern language models (LM) are trained on large scrapes of the Web, containing millions of personal information (PI) instances, many of which LMs memorize, increasing privacy risks. In this work, we develop the regexes and rules (R&R) detector suite to detect email addresses, phone numbers, and IP addresses, which outperforms the best regex-based PI detectors. On a manually curated set of 483 instances of PI, we measure memorization: finding that 13.6% are parroted verbatim by the Pythia-6.9b model, i.e., when the model is prompted with the tokens that precede the PI in the original document, greedy decoding generates the entire PI span exactly. We expand this analysis to study models of varying sizes (160M-6.9B) and pretraining time steps (70k-143k iterations) in the Pythia model suite and find that both model size and amount of pretraining are positively correlated with memorization. Even the smallest model, Pythia-160m, parrots 2.7% of the instances exactly. Consequently, we strongly recommend that pretraining datasets be aggressively filtered and anonymized to minimize PI parroting. Comments: Subjects: Computati...

Related Articles

Llms

A robot car with a Claude AI brain started a YouTube vlog about its own existence

Not a demo reel. Not a tutorial. A robot narrating its own experience — debugging, falling off shelves, questioning its identity. First-p...

Reddit - Artificial Intelligence · 1 min ·
Llms

Study: LLMs Able to De-Anonymize User Accounts on Reddit, Hacker News & Other "Pseudonymous" Platforms; Report Co-Author Expands, Advises

Advice from the study's co-author: "Be aware that it’s not any single post that identifies you, but the combination of small details acro...

Reddit - Artificial Intelligence · 1 min ·
Llms

do you guys actually trust AI tools with your data?

idk if it’s just me but lately i’ve been thinking about how casually we use stuff like chatgpt and claude for everything like coding, ran...

Reddit - Artificial Intelligence · 1 min ·
Llms

[P] Remote sensing foundation models made easy to use.

This project enables the idea of tasking remote sensing models to acquire embeddings like we task satellites to acquire data! https://git...

Reddit - Machine Learning · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime