[2603.05314] PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration
About this article
Abstract page for arXiv paper 2603.05314: PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration
Computer Science > Computation and Language arXiv:2603.05314 (cs) [Submitted on 5 Mar 2026] Title:PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration Authors:Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery View a PDF of the paper titled PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration, by Mohammad Javad Ranjbar Kalahroodi and 2 other authors View PDF Abstract:Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition (ASR) outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset of 17 million samples for Persian punctuation restoration, constructed through systematic aggregation and filtering of existing textual resources. We formulate punctuation restoration as a token-level sequence labeling task and fine-tune ParsBERT to achieve strong performance. Through comparative evaluation, we demonstrate that while large language models can perform punctuation restoration, they suffer from critical limitations: over-correction tendencies that introduce undesired edits beyond punctuation insertion (particularly problematic for speech-to-text pipelines) and substantially higher computational requirements. Our lightweight BERT-based approach achieves a macro-averaged F1 score of 91.33% on our test set while maintaining efficiency suitable for...