[2602.13455] Using Machine Learning to Enhance the Detection of Obfuscated Abusive Words in Swahili: A Focus on Child Safety
Summary
This article explores the use of machine learning to detect obfuscated abusive language in Swahili, focusing on child safety and the challenges posed by limited resources.
Why It Matters
As digital platforms grow, so does the risk of online abuse, particularly towards children. This research addresses the urgent need for effective detection methods in low-resource languages like Swahili, contributing to safer online environments.
Key Takeaways
- Machine learning models like SVM and Logistic Regression can enhance detection of abusive language.
- Swahili presents unique challenges due to its status as a low-resource language.
- Data imbalance limits the generalizability of findings, highlighting the need for larger datasets.
- Precision, recall, and F1 scores are crucial metrics for evaluating model performance.
- Future work should focus on data robustness and the integration of multimodal data.
Computer Science > Computation and Language arXiv:2602.13455 (cs) [Submitted on 13 Feb 2026] Title:Using Machine Learning to Enhance the Detection of Obfuscated Abusive Words in Swahili: A Focus on Child Safety Authors:Phyllis Nabangi, Abdul-Jalil Zakaria, Jema David Ndibwile View a PDF of the paper titled Using Machine Learning to Enhance the Detection of Obfuscated Abusive Words in Swahili: A Focus on Child Safety, by Phyllis Nabangi and 2 other authors View PDF HTML (experimental) Abstract:The rise of digital technology has dramatically increased the potential for cyberbullying and online abuse, necessitating enhanced measures for detection and prevention, especially among children. This study focuses on detecting abusive obfuscated language in Swahili, a low-resource language that poses unique challenges due to its limited linguistic resources and technological support. Swahili is chosen due to its popularity and being the most widely spoken language in Africa, with over 16 million native speakers and upwards of 100 million speakers in total, spanning regions in East Africa and some parts of the Middle East. We employed machine learning models including Support Vector Machines (SVM), Logistic Regression, and Decision Trees, optimized through rigorous parameter tuning and techniques like Synthetic Minority Over-sampling Technique (SMOTE) to handle data imbalance. Our analysis revealed that, while these models perform well in high-dimensional textual data, our dataset's ...