[2602.13352] Using Deep Learning to Generate Semantically Correct Hindi Captions
Summary
This article explores the use of deep learning techniques to generate semantically accurate image captions in Hindi, utilizing advanced models and attention mechanisms.
Why It Matters
With a significant focus on English in image captioning research, this study addresses a gap by generating captions in Hindi, the fourth most spoken language globally. This advancement can enhance accessibility and usability of image content for Hindi speakers, promoting inclusivity in AI applications.
Key Takeaways
- The research employs multi-modal architectures for image captioning in Hindi.
- Attention-based bidirectional LSTM with VGG16 achieved the best BLEU scores.
- The study highlights the potential for further research in multilingual image captioning.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.13352 (cs) [Submitted on 13 Feb 2026] Title:Using Deep Learning to Generate Semantically Correct Hindi Captions Authors:Wasim Akram Khan, Anil Kumar Vuppala View a PDF of the paper titled Using Deep Learning to Generate Semantically Correct Hindi Captions, by Wasim Akram Khan and 1 other authors View PDF Abstract:Automated image captioning using the content from the image is very appealing when done by harnessing the capability of computer vision and natural language processing. Extensive research has been done in the field with a major focus on the English language which gives the scope for further developments in the same with consideration of popular foreign languages. This research utilizes distinct models for translating the image caption into Hindi, the fourth most popular language across the world. Exploring the multi-modal architectures this research comprises local visual features, global visual features, attention mechanisms, and pre-trained models. Using google cloud translator on the image dataset from Flickr8k, Hindi image descriptions have been generated. Pre-trained CNNs like VGG16, ResNet50, and Inception V3 helped in retrieving image characteristics, while the uni-directional and bi-directional techniques of text encoding are used for the text encoding process. An additional Attention layer helps to generate a weight vector and, by multiplying it, combine image characteristics from each ...