[2603.05314] PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration

[2603.05314] PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration

arXiv - AI 3 min read

About this article

Abstract page for arXiv paper 2603.05314: PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration

Computer Science > Computation and Language arXiv:2603.05314 (cs) [Submitted on 5 Mar 2026] Title:PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration Authors:Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery View a PDF of the paper titled PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration, by Mohammad Javad Ranjbar Kalahroodi and 2 other authors View PDF Abstract:Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition (ASR) outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset of 17 million samples for Persian punctuation restoration, constructed through systematic aggregation and filtering of existing textual resources. We formulate punctuation restoration as a token-level sequence labeling task and fine-tune ParsBERT to achieve strong performance. Through comparative evaluation, we demonstrate that while large language models can perform punctuation restoration, they suffer from critical limitations: over-correction tendencies that introduce undesired edits beyond punctuation insertion (particularly problematic for speech-to-text pipelines) and substantially higher computational requirements. Our lightweight BERT-based approach achieves a macro-averaged F1 score of 91.33% on our test set while maintaining efficiency suitable for...

Originally published on March 06, 2026. Curated by AI News.

Related Articles

Llms

If AI is really making us more productive... why does it feel like we are working more, not less...?

The promise of AI was the ultimate system optimisation: Efficiency. On paper, the tools are delivering something similar to what they pro...

Reddit - Artificial Intelligence · 1 min ·
Ai Infrastructure

[P] Built an open source tool to find the location of any street picture

Hey guys, Thank you so much for your love and support regarding Netryx Astra V2 last time. Many people are not that technically savvy to ...

Reddit - Machine Learning · 1 min ·
Llms

[R] GPT-5.4-mini regressed 22pp on vanilla prompting vs GPT-5-mini. Nobody noticed because benchmarks don't test this. Recursive Language Models solved it.

GPT-5.4-mini produces shorter, terser outputs by default. Vanilla accuracy dropped from 69.5% to 47.2% across 12 tasks (1,800 evals). The...

Reddit - Machine Learning · 1 min ·
UMKC Announces New Master of Science in Artificial Intelligence
Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min ·
More in Ai Infrastructure: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime