[2604.09016] Identification and Anonymization of Named Entities in Unstructured Information Sources for Use in Social Engineering Detection
About this article
Abstract page for arXiv paper 2604.09016: Identification and Anonymization of Named Entities in Unstructured Information Sources for Use in Social Engineering Detection
Computer Science > Machine Learning arXiv:2604.09016 (cs) [Submitted on 10 Apr 2026] Title:Identification and Anonymization of Named Entities in Unstructured Information Sources for Use in Social Engineering Detection Authors:Carlos Jimeno Miguel, Raul Orduna, Francesco Zola View a PDF of the paper titled Identification and Anonymization of Named Entities in Unstructured Information Sources for Use in Social Engineering Detection, by Carlos Jimeno Miguel and 2 other authors View PDF HTML (experimental) Abstract:This study addresses the challenge of creating datasets for cybercrime analysis while complying with the requirements of regulations such as the General Data Protection Regulation (GDPR) and Organic Law 10/1995 of the Penal Code. To this end, a system is proposed for collecting information from the Telegram platform, including text, audio, and images; the implementation of speech-to-text transcription models incorporating signal enhancement techniques; and the evaluation of different Named Entity Recognition (NER) solutions, including Microsoft Presidio and AI models designed using a transformer-based architecture. Experimental results indicate that Parakeet achieves the best performance in audio transcription, while the proposed NER solutions achieve the highest f1-score values in detecting sensitive information. In addition, anonymization metrics are presented that allow evaluation of the preservation of structural coherence in the data, while simultaneously guara...