[2602.18089] DohaScript: A Large-Scale Multi-Writer Dataset for Continuous Handwritten Hindi Text
Summary
DohaScript introduces a large-scale dataset for continuous handwritten Hindi text, addressing the lack of diverse and high-quality resources for handwriting analysis in Devanagari script.
Why It Matters
This dataset is crucial for advancing research in handwriting recognition and analysis, particularly for low-resource languages like Hindi. It provides a standardized benchmark that can improve machine learning models and applications in natural language processing and computer vision.
Key Takeaways
- DohaScript is a large-scale dataset featuring continuous handwritten Hindi text from 531 contributors.
- The dataset allows for systematic analysis of writer-specific variations in handwriting.
- It supports various applications, including handwriting recognition and style analysis.
- Rigorous quality curation ensures high reliability and practical value for researchers.
- DohaScript aims to fill the gap in resources for Devanagari handwriting analysis.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.18089 (cs) [Submitted on 20 Feb 2026] Title:DohaScript: A Large-Scale Multi-Writer Dataset for Continuous Handwritten Hindi Text Authors:Kunwar Arpit Singh, Ankush Prakash, Haroon R Lone View a PDF of the paper titled DohaScript: A Large-Scale Multi-Writer Dataset for Continuous Handwritten Hindi Text, by Kunwar Arpit Singh and 2 other authors View PDF HTML (experimental) Abstract:Despite having hundreds of millions of speakers, handwritten Devanagari text remains severely underrepresented in publicly available benchmark datasets. Existing resources are limited in scale, focus primarily on isolated characters or short words, and lack controlled lexical content and writer level diversity, which restricts their utility for modern data driven handwriting analysis. As a result, they fail to capture the continuous, fused, and structurally complex nature of Devanagari handwriting, where characters are connected through a shared shirorekha (horizontal headline) and exhibit rich ligature formations. We introduce DohaScript, a large scale, multi writer dataset of handwritten Hindi text collected from 531 unique contributors. The dataset is designed as a parallel stylistic corpus, in which all writers transcribe the same fixed set of six traditional Hindi dohas (couplets). This controlled design enables systematic analysis of writer specific variation independent of linguistic content, and supports tasks such as h...