[2603.05314] PersianPunc: A Large-Scale Dataset and BERT-Based

[2603.05314] PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration

arXiv - AI March 06, 2026 3 min read

About this article

Abstract page for arXiv paper 2603.05314: PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration

Computer Science > Computation and Language arXiv:2603.05314 (cs) [Submitted on 5 Mar 2026] Title:PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration Authors:Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery View a PDF of the paper titled PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration, by Mohammad Javad Ranjbar Kalahroodi and 2 other authors View PDF Abstract:Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition (ASR) outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset of 17 million samples for Persian punctuation restoration, constructed through systematic aggregation and filtering of existing textual resources. We formulate punctuation restoration as a token-level sequence labeling task and fine-tune ParsBERT to achieve strong performance. Through comparative evaluation, we demonstrate that while large language models can perform punctuation restoration, they suffer from critical limitations: over-correction tendencies that introduce undesired edits beyond punctuation insertion (particularly problematic for speech-to-text pipelines) and substantially higher computational requirements. Our lightweight BERT-based approach achieves a macro-averaged F1 score of 91.33% on our test set while maintaining efficiency suitable for...

Originally published on March 06, 2026. Curated by AI News.

Llms

If AI is really making us more productive... why does it feel like we are working more, not less...?

The promise of AI was the ultimate system optimisation: Efficiency. On paper, the tools are delivering something similar to what they pro...

Reddit - Artificial Intelligence · 1 min · about 5 hours ago

Ai Infrastructure

[P] Built an open source tool to find the location of any street picture

Hey guys, Thank you so much for your love and support regarding Netryx Astra V2 last time. Many people are not that technically savvy to ...

Reddit - Machine Learning · 1 min · about 9 hours ago

Llms

[R] GPT-5.4-mini regressed 22pp on vanilla prompting vs GPT-5-mini. Nobody noticed because benchmarks don't test this. Recursive Language Models solved it.

GPT-5.4-mini produces shorter, terser outputs by default. Vanilla accuracy dropped from 69.5% to 47.2% across 12 tasks (1,800 evals). The...

Reddit - Machine Learning · 1 min · about 13 hours ago

Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min · about 14 hours ago

[2603.05314] PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration

About this article

Related Articles

If AI is really making us more productive... why does it feel like we are working more, not less...?

[P] Built an open source tool to find the location of any street picture

[R] GPT-5.4-mini regressed 22pp on vanilla prompting vs GPT-5-mini. Nobody noticed because benchmarks don't test this. Recursive Language Models solved it.

UMKC Announces New Master of Science in Artificial Intelligence

No comments

Stay updated with AI News