[2602.14073] Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework
Summary
This article presents a methodology for adapting vision-language models to the Polish language using the LLaVA framework, demonstrating significant improvements in model performance with minimal manual intervention.
Why It Matters
The adaptation of vision-language models to non-English languages is crucial for inclusivity in AI technology. This research addresses the limitations of existing models by providing a robust approach for Polish, which can serve as a model for other low-resource languages, enhancing accessibility and usability in diverse cultural contexts.
Key Takeaways
- The study demonstrates a +9.5% performance improvement in Polish VLMs using automated translation methods.
- The approach relies on minimal manual intervention, showcasing efficiency in model adaptation.
- Public availability of models and datasets can facilitate further research in low-resource language processing.
- Challenges remain in cultural coverage and evaluation, indicating areas for future work.
- This research highlights the importance of multilingual capabilities in AI systems.
Computer Science > Computation and Language arXiv:2602.14073 (cs) [Submitted on 15 Feb 2026] Title:Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework Authors:Grzegorz Statkiewicz, Alicja Dobrzeniecka, Karolina Seweryn, Aleksandra Krasnodębska, Karolina Piosek, Katarzyna Bogusz, Sebastian Cygert, Wojciech Kusa View a PDF of the paper titled Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework, by Grzegorz Statkiewicz and 7 other authors View PDF HTML (experimental) Abstract:Most vision-language models (VLMs) are trained on English-centric data, limiting their performance in other languages and cultural contexts. This restricts their usability for non-English-speaking users and hinders the development of multimodal systems that reflect diverse linguistic and cultural realities. In this work, we reproduce and adapt the LLaVA-Next methodology to create a set of Polish VLMs. We rely on a fully automated pipeline for translating and filtering existing multimodal datasets, and complement this with synthetic Polish data for OCR and culturally specific tasks. Despite relying almost entirely on automatic translation and minimal manual intervention to the training data, our approach yields strong results: we observe a +9.5% improvement over LLaVA-1.6-Vicuna-13B on a Polish-adapted MMBench, along with higher-quality captions in generative evaluations, as measured by human annotators i...