[2604.04790] HUKUKBERT: Domain-Specific Language Model for Turkish Law

[2604.04790] HUKUKBERT: Domain-Specific Language Model for Turkish Law

arXiv - Machine Learning 4 min read

About this article

Abstract page for arXiv paper 2604.04790: HUKUKBERT: Domain-Specific Language Model for Turkish Law

Computer Science > Computation and Language arXiv:2604.04790 (cs) [Submitted on 6 Apr 2026] Title:HUKUKBERT: Domain-Specific Language Model for Turkish Law Authors:Mehmet Utku Öztürk, Tansu Türkoğlu, Buse Buz-Yalug View a PDF of the paper titled HUKUKBERT: Domain-Specific Language Model for Turkish Law, by Mehmet Utku \"Ozt\"urk and 2 other authors View PDF HTML (experimental) Abstract:Recent advances in natural language processing (NLP) have increasingly enabled LegalTech applications, yet existing studies specific to Turkish law have still been limited due to the scarcity of domain-specific data and models. Although extensive models like LEGAL-BERT have been developed for English legal texts, the Turkish legal domain lacks a domain-specific high-volume counterpart. In this paper, we introduce HukukBERT, the most comprehensive legal language model for Turkish, trained on a 18 GB cleaned legal corpus using a hybrid Domain-Adaptive Pre-Training (DAPT) methodology integrating Whole-Word Masking, Token Span Masking, Word Span Masking, and targeted Keyword Masking. We systematically compared our 48K WordPiece tokenizer and DAPT approach against general-purpose and existing domain-specific Turkish models. Evaluated on a novel Legal Cloze Test benchmark -- a masked legal term prediction task designed for Turkish court decisions -- HukukBERT achieves state-of-the-art performance with 84.40\% Top-1 accuracy, substantially outperforming existing models. Furthermore, we evaluated Hu...

Originally published on April 07, 2026. Curated by AI News.

Related Articles

Llms

If AI is about to get 10x smarter, how do we prevent the internet from collapsing under synthetic noise?

Im all for acceleration. I think the faster we hit AGI the better. but theres a bottleneck nobody here talks about enough-training data. ...

Reddit - Artificial Intelligence · 1 min ·
Llms

Qwen3 4B outperforms cloud agents on code tasks—with Mahoraga research [R]

Hey everyone in ML. I've been working on Mahoraga, an open-source orchestrator that routes tasks across local and cloud AI agents using a...

Reddit - Machine Learning · 1 min ·
Llms

Associative memory system for LLMs that learns during inference [P]

I've been working on MDA (Modular Dynamic Architecture), an online associative memory system for LLMs. Here's what I learned building it....

Reddit - Machine Learning · 1 min ·
Llms

Things I got wrong building a confidence evaluator for local LLMs [D]

I've been building **Autodidact**, a local-first AI agent framework. The central piece is a **confidence evaluator** - something that dec...

Reddit - Machine Learning · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime