[2604.04790] HUKUKBERT: Domain-Specific Language Model for Turkish Law
About this article
Abstract page for arXiv paper 2604.04790: HUKUKBERT: Domain-Specific Language Model for Turkish Law
Computer Science > Computation and Language arXiv:2604.04790 (cs) [Submitted on 6 Apr 2026] Title:HUKUKBERT: Domain-Specific Language Model for Turkish Law Authors:Mehmet Utku Öztürk, Tansu Türkoğlu, Buse Buz-Yalug View a PDF of the paper titled HUKUKBERT: Domain-Specific Language Model for Turkish Law, by Mehmet Utku \"Ozt\"urk and 2 other authors View PDF HTML (experimental) Abstract:Recent advances in natural language processing (NLP) have increasingly enabled LegalTech applications, yet existing studies specific to Turkish law have still been limited due to the scarcity of domain-specific data and models. Although extensive models like LEGAL-BERT have been developed for English legal texts, the Turkish legal domain lacks a domain-specific high-volume counterpart. In this paper, we introduce HukukBERT, the most comprehensive legal language model for Turkish, trained on a 18 GB cleaned legal corpus using a hybrid Domain-Adaptive Pre-Training (DAPT) methodology integrating Whole-Word Masking, Token Span Masking, Word Span Masking, and targeted Keyword Masking. We systematically compared our 48K WordPiece tokenizer and DAPT approach against general-purpose and existing domain-specific Turkish models. Evaluated on a novel Legal Cloze Test benchmark -- a masked legal term prediction task designed for Turkish court decisions -- HukukBERT achieves state-of-the-art performance with 84.40\% Top-1 accuracy, substantially outperforming existing models. Furthermore, we evaluated Hu...