[2602.17051] Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data
Summary
This article evaluates various cross-lingual classification methods for analyzing multilingual social media data, focusing on topic discovery related to hydrogen energy across multiple languages.
Why It Matters
Understanding multilingual discourse on social media is crucial for global communication and data analysis. This study provides insights into effective classification techniques that can enhance the analysis of diverse language datasets, which is increasingly relevant in a globalized digital landscape.
Key Takeaways
- The study assesses four cross-lingual classification approaches for filtering relevant social media content.
- A decade-long dataset of over nine million tweets was analyzed to extract dominant themes.
- Key trade-offs between translation and multilingual approaches were identified.
- The findings offer actionable insights for optimizing cross-lingual pipelines.
- The research highlights the challenges of analyzing noisy keyword-driven data.
Computer Science > Computation and Language arXiv:2602.17051 (cs) [Submitted on 19 Feb 2026] Title:Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data Authors:Deepak Uniyal, Md Abul Bashar, Richi Nayak View a PDF of the paper titled Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data, by Deepak Uniyal and 2 other authors View PDF HTML (experimental) Abstract:Analysing multilingual social media discourse remains a major challenge in natural language processing, particularly when large-scale public debates span across diverse languages. This study investigates how different approaches for cross-lingual text classification can support reliable analysis of global conversations. Using hydrogen energy as a case study, we analyse a decade-long dataset of over nine million tweets in English, Japanese, Hindi, and Korean (2013--2022) for topic discovery. The online keyword-driven data collection results in a significant amount of irrelevant content. We explore four approaches to filter relevant content: (1) translating English annotated data into target languages for building language-specific models for each target language, (2) translating unlabelled data appearing from all languages into English for creating a single model based on English annotations, (3) applying English fine-tuned multilingual transformers directly to each target language data, and (4) a hybrid...