[2602.21543] Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment
Summary
This paper presents a method for enhancing multilingual embeddings through multi-way parallel text alignment, demonstrating improved cross-lingual representation for natural language understanding tasks.
Why It Matters
The research addresses the challenges of cross-lingual alignment in multilingual pretraining, which is crucial for developing robust natural language processing models that can effectively handle multiple languages. The findings can significantly impact multilingual applications and improve performance in various NLU tasks.
Key Takeaways
- Utilizing a multi-way parallel corpus enhances cross-lingual alignment.
- Contrastive learning with diverse languages yields significant performance gains.
- Improvements were noted across various tasks, including bitext mining and classification.
Computer Science > Computation and Language arXiv:2602.21543 (cs) [Submitted on 25 Feb 2026] Title:Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment Authors:Barah Fazili, Koustava Goswami View a PDF of the paper titled Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment, by Barah Fazili and Koustava Goswami View PDF Abstract:Multilingual pretraining typically lacks explicit alignment signals, leading to suboptimal cross-lingual alignment in the representation space. In this work, we show that training standard pretrained models for cross-lingual alignment with a multi-way parallel corpus in a diverse pool of languages can substantially improve multilingual and cross-lingual representations for NLU tasks. We construct a multi-way parallel dataset using translations of English text from an off-the-shelf NMT model for a pool of six target languages and achieve strong cross-lingual alignment through contrastive learning. This leads to substantial performance gains across both seen and unseen languages for multiple tasks from the MTEB benchmark evaluated for XLM-Roberta and multilingual BERT base models. Using a multi-way parallel corpus for contrastive training yields substantial gains on bitext mining (21.3%), semantic similarity (5.3%), and classification (28.4%) compared to English-centric (En-X) bilingually parallel data, where X is sampled from a pool of multiple target languages. Furthermore, finetuning mE5 model on a small datas...