[2602.21379] MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation

[2602.21379] MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation

arXiv - Machine Learning 3 min read Article

Summary

MrBERT introduces a family of multilingual encoders optimized for various domains, achieving state-of-the-art results in specific tasks while enhancing efficiency in storage and inference.

Why It Matters

As multilingual models become increasingly important in AI applications, MrBERT's advancements in vocabulary and domain adaptation can significantly improve performance in specialized areas like biomedical and legal fields. This research bridges the gap between theoretical models and practical applications, making it relevant for developers and researchers in NLP.

Key Takeaways

  • MrBERT utilizes a modern architecture with 150M-300M parameters, pre-trained on 35 languages.
  • The model excels in Catalan and Spanish tasks and shows robust performance in specialized domains.
  • Incorporates Matryoshka Representation Learning for flexible vector sizing, reducing costs.
  • Open-sourced on Hugging Face, promoting accessibility for developers and researchers.
  • Demonstrates the potential of modern encoders for both linguistic excellence and domain specialization.

Computer Science > Computation and Language arXiv:2602.21379 (cs) [Submitted on 24 Feb 2026] Title:MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation Authors:Daniel Tamayo, Iñaki Lacunza, Paula Rivera-Hidalgo, Severino Da Dalt, Javier Aula-Blasco, Aitor Gonzalez-Agirre, Marta Villegas View a PDF of the paper titled MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation, by Daniel Tamayo and I\~naki Lacunza and Paula Rivera-Hidalgo and Severino Da Dalt and Javier Aula-Blasco and Aitor Gonzalez-Agirre and Marta Villegas View PDF Abstract:We introduce MrBERT, a family of 150M-300M parameter encoders built on the ModernBERT architecture and pre-trained on 35 languages and code. Through targeted adaptation, this model family achieves state-of-the-art results on Catalan- and Spanish-specific tasks, while establishing robust performance across specialized biomedical and legal domains. To bridge the gap between research and production, we incorporate Matryoshka Representation Learning (MRL), enabling flexible vector sizing that significantly reduces inference and storage costs. Ultimately, the MrBERT family demonstrates that modern encoder architectures can be optimized for both localized linguistic excellence and efficient, high-stakes domain specialization. We open source the complete model family on Huggingface. Comments: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Lea...

Related Articles

PSA: Anyone with a link can view your Granola notes by default | The Verge
Machine Learning

PSA: Anyone with a link can view your Granola notes by default | The Verge

Granola, the AI-powered note-taking app, makes your notes viewable by anyone with a link by default. It also turns on AI training for any...

The Verge - AI · 5 min ·
Machine Learning

[D] On-Device Real-Time Visibility Restoration: Deterministic CV vs. Quantized ML Models. Looking for insights on Edge Preservation vs. Latency.

Hey everyone, We have been working on a real-time camera engine for iOS that currently uses a purely deterministic Computer Vision approa...

Reddit - Machine Learning · 1 min ·
Llms

[R] Is autoresearch really better than classic hyperparameter tuning?

We did experiments comparing Optuna & autoresearch. Autoresearch converges faster, is more cost-efficient, and even generalizes bette...

Reddit - Machine Learning · 1 min ·
Llms

[R] Solving the Jane Street Dormant LLM Challenge: A Systematic Approach to Backdoor Discovery

Submitted by: Adam Kruger Date: March 23, 2026 Models Solved: 3/3 (M1, M2, M3) + Warmup Background When we first encountered the Jane Str...

Reddit - Machine Learning · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime