Llms Machine Learning Ai Infrastructure Ai Safety

[2602.19020] Learning to Detect Language Model Training Data via Active Reconstruction

arXiv - AI February 24, 2026 4 min read Article

Summary

This paper introduces the Active Data Reconstruction Attack (ADRA), a novel approach to detect language model training data by leveraging reinforcement learning to reconstruct text, outperforming traditional membership inference attacks.

Why It Matters

As language models become more prevalent, understanding their training data is crucial for ensuring transparency and accountability in AI systems. This research offers a new method to identify training data, which can help mitigate risks associated with data privacy and model misuse.

Key Takeaways

ADRA utilizes reinforcement learning to actively reconstruct training data.
The method shows a significant improvement in detecting training data compared to existing techniques.
ADRA's adaptive variant enhances performance further, particularly in pre-training detection.
The research highlights the importance of understanding model training data for AI safety.
This approach can inform future developments in membership inference attacks.

Computer Science > Machine Learning arXiv:2602.19020 (cs) [Submitted on 22 Feb 2026] Title:Learning to Detect Language Model Training Data via Active Reconstruction Authors:Junjie Oscar Yin, John X. Morris, Vitaly Shmatikov, Sewon Min, Hannaneh Hajishirzi View a PDF of the paper titled Learning to Detect Language Model Training Data via Active Reconstruction, by Junjie Oscar Yin and 4 other authors View PDF HTML (experimental) Abstract:Detecting LLM training data is generally framed as a membership inference attack (MIA) problem. However, conventional MIAs operate passively on fixed model weights, using log-likelihoods or text generations. In this work, we introduce \textbf{Active Data Reconstruction Attack} (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. We hypothesize that training data are \textit{more reconstructible} than non-members, and the difference in their reconstructibility can be exploited for membership inference. Motivated by findings that reinforcement learning (RL) sharpens behaviors already encoded in weights, we leverage on-policy RL to actively elicit data reconstruction by finetuning a policy initialized from the target model. To effectively use RL for MIA, we design reconstruction metrics and contrastive rewards. The resulting algorithms, \textsc{ADRA} and its adaptive variant \textsc{ADRA+}, improve both reconstruction and detection given a pool of candidate data. Experiments show that our methods c...

Read Original Article

[2602.19020] Learning to Detect Language Model Training Data via Active Reconstruction

Summary

Why It Matters

Key Takeaways

Related Articles

I am seeing Claude everywhere

Claude Opus 4.6 API at 40% below Anthropic pricing – try free before you pay anything

Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED

People anxious about deviating from what AI tells them to do?

No comments

Stay updated with AI News