[2602.22207] Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets
Summary
The paper presents an automated framework for translating benchmarks and datasets for multilingual Large Language Model evaluation, addressing issues of semantic drift and context loss.
Why It Matters
As AI systems increasingly rely on multilingual datasets, ensuring the quality of translations is crucial for accurate model assessments. This framework enhances the reliability of evaluations, promoting robust AI development across diverse languages.
Key Takeaways
- Introduces a fully automated translation framework for benchmarks.
- Addresses challenges of semantic drift and context loss in translations.
- Implements multi-round ranking methods for improved translation quality.
- Demonstrates superior performance of translations over existing resources.
- Releases frameworks and benchmarks to support reproducible multilingual AI development.
Computer Science > Computation and Language arXiv:2602.22207 (cs) [Submitted on 25 Feb 2026] Title:Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets Authors:Hanna Yukhymenko, Anton Alexandrov, Martin Vechev View a PDF of the paper titled Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets, by Hanna Yukhymenko and 2 other authors View PDF HTML (experimental) Abstract:The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks. Existing resources often suffer from semantic drift and context loss, which can lead to misleading performance metrics. In this work, we present a fully automated framework designed to address these challenges by enabling scalable, high-quality translation of datasets and benchmarks. We demonstrate that adapting test-time compute scaling strategies, specifically Universal Self-Improvement (USI) and our proposed multi-round ranking method, T-RANK, allows for significantly higher quality outputs compared to traditional pipelines. Our framework ensures that benchmarks preserve their original task structure and linguistic nuances during localization. We apply this approach to translate popular benchmarks and datasets into eight Eastern and Southern European languages (Ukrainian, Bulgarian, Slovak, Romanian, Lithuanian, Estonian, Turkish, Greek). Evaluations using both referenc...