[2602.19020] Learning to Detect Language Model Training Data via Active Reconstruction

[2602.19020] Learning to Detect Language Model Training Data via Active Reconstruction

arXiv - AI 4 min read Article

Summary

This paper introduces the Active Data Reconstruction Attack (ADRA), a novel approach to detect language model training data by leveraging reinforcement learning to reconstruct text, outperforming traditional membership inference attacks.

Why It Matters

As language models become more prevalent, understanding their training data is crucial for ensuring transparency and accountability in AI systems. This research offers a new method to identify training data, which can help mitigate risks associated with data privacy and model misuse.

Key Takeaways

  • ADRA utilizes reinforcement learning to actively reconstruct training data.
  • The method shows a significant improvement in detecting training data compared to existing techniques.
  • ADRA's adaptive variant enhances performance further, particularly in pre-training detection.
  • The research highlights the importance of understanding model training data for AI safety.
  • This approach can inform future developments in membership inference attacks.

Computer Science > Machine Learning arXiv:2602.19020 (cs) [Submitted on 22 Feb 2026] Title:Learning to Detect Language Model Training Data via Active Reconstruction Authors:Junjie Oscar Yin, John X. Morris, Vitaly Shmatikov, Sewon Min, Hannaneh Hajishirzi View a PDF of the paper titled Learning to Detect Language Model Training Data via Active Reconstruction, by Junjie Oscar Yin and 4 other authors View PDF HTML (experimental) Abstract:Detecting LLM training data is generally framed as a membership inference attack (MIA) problem. However, conventional MIAs operate passively on fixed model weights, using log-likelihoods or text generations. In this work, we introduce \textbf{Active Data Reconstruction Attack} (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. We hypothesize that training data are \textit{more reconstructible} than non-members, and the difference in their reconstructibility can be exploited for membership inference. Motivated by findings that reinforcement learning (RL) sharpens behaviors already encoded in weights, we leverage on-policy RL to actively elicit data reconstruction by finetuning a policy initialized from the target model. To effectively use RL for MIA, we design reconstruction metrics and contrastive rewards. The resulting algorithms, \textsc{ADRA} and its adaptive variant \textsc{ADRA+}, improve both reconstruction and detection given a pool of candidate data. Experiments show that our methods c...

Related Articles

Llms

I am seeing Claude everywhere

Every single Instagram reel or TikTok I scroll i see people mentioning Claude and glazing it like it’s some kind of master tool that’s be...

Reddit - Artificial Intelligence · 1 min ·
Llms

Claude Opus 4.6 API at 40% below Anthropic pricing – try free before you pay anything

Hey everyone I've set up a self-hosted API gateway using [New-API](QuantumNous/new-ap) to manage and distribute Claude Opus 4.6 access ac...

Reddit - Artificial Intelligence · 1 min ·
Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED
Llms

Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED

Plus: The FBI says a recent hack of its wiretap tools poses a national security risk, attackers stole Cisco source code as part of an ong...

Wired - AI · 9 min ·
Llms

People anxious about deviating from what AI tells them to do?

My friend came over yesterday to dye her hair. She had asked ChatGPT for the 'correct' way to do it. Chat told her to dye the ends first,...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime