[2604.04790] HUKUKBERT: Domain-Specific Language Model for Turkish Law

arXiv - Machine Learning April 07, 2026 4 min read

About this article

Abstract page for arXiv paper 2604.04790: HUKUKBERT: Domain-Specific Language Model for Turkish Law

Computer Science > Computation and Language arXiv:2604.04790 (cs) [Submitted on 6 Apr 2026] Title:HUKUKBERT: Domain-Specific Language Model for Turkish Law Authors:Mehmet Utku Öztürk, Tansu Türkoğlu, Buse Buz-Yalug View a PDF of the paper titled HUKUKBERT: Domain-Specific Language Model for Turkish Law, by Mehmet Utku \"Ozt\"urk and 2 other authors View PDF HTML (experimental) Abstract:Recent advances in natural language processing (NLP) have increasingly enabled LegalTech applications, yet existing studies specific to Turkish law have still been limited due to the scarcity of domain-specific data and models. Although extensive models like LEGAL-BERT have been developed for English legal texts, the Turkish legal domain lacks a domain-specific high-volume counterpart. In this paper, we introduce HukukBERT, the most comprehensive legal language model for Turkish, trained on a 18 GB cleaned legal corpus using a hybrid Domain-Adaptive Pre-Training (DAPT) methodology integrating Whole-Word Masking, Token Span Masking, Word Span Masking, and targeted Keyword Masking. We systematically compared our 48K WordPiece tokenizer and DAPT approach against general-purpose and existing domain-specific Turkish models. Evaluated on a novel Legal Cloze Test benchmark -- a masked legal term prediction task designed for Turkish court decisions -- HukukBERT achieves state-of-the-art performance with 84.40\% Top-1 accuracy, substantially outperforming existing models. Furthermore, we evaluated Hu...

Originally published on April 07, 2026. Curated by AI News.

Llms

If AI is about to get 10x smarter, how do we prevent the internet from collapsing under synthetic noise?

Im all for acceleration. I think the faster we hit AGI the better. but theres a bottleneck nobody here talks about enough-training data. ...

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

Llms

Qwen3 4B outperforms cloud agents on code tasks—with Mahoraga research [R]

Hey everyone in ML. I've been working on Mahoraga, an open-source orchestrator that routes tasks across local and cloud AI agents using a...

Reddit - Machine Learning · 1 min · about 2 hours ago

Llms

Associative memory system for LLMs that learns during inference [P]

I've been working on MDA (Modular Dynamic Architecture), an online associative memory system for LLMs. Here's what I learned building it....

Reddit - Machine Learning · 1 min · about 7 hours ago

Llms

Things I got wrong building a confidence evaluator for local LLMs [D]

I've been building **Autodidact**, a local-first AI agent framework. The central piece is a **confidence evaluator** - something that dec...

Reddit - Machine Learning · 1 min · about 9 hours ago

[2604.04790] HUKUKBERT: Domain-Specific Language Model for Turkish Law

About this article

Related Articles

If AI is about to get 10x smarter, how do we prevent the internet from collapsing under synthetic noise?

Qwen3 4B outperforms cloud agents on code tasks—with Mahoraga research [R]

Associative memory system for LLMs that learns during inference [P]

Things I got wrong building a confidence evaluator for local LLMs [D]

No comments

Stay updated with AI News