[2602.13352] Using Deep Learning to Generate Semantically Correct Hindi Captions

[2602.13352] Using Deep Learning to Generate Semantically Correct Hindi Captions

arXiv - AI 4 min read Article

Summary

This article explores the use of deep learning techniques to generate semantically accurate image captions in Hindi, utilizing advanced models and attention mechanisms.

Why It Matters

With a significant focus on English in image captioning research, this study addresses a gap by generating captions in Hindi, the fourth most spoken language globally. This advancement can enhance accessibility and usability of image content for Hindi speakers, promoting inclusivity in AI applications.

Key Takeaways

  • The research employs multi-modal architectures for image captioning in Hindi.
  • Attention-based bidirectional LSTM with VGG16 achieved the best BLEU scores.
  • The study highlights the potential for further research in multilingual image captioning.

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.13352 (cs) [Submitted on 13 Feb 2026] Title:Using Deep Learning to Generate Semantically Correct Hindi Captions Authors:Wasim Akram Khan, Anil Kumar Vuppala View a PDF of the paper titled Using Deep Learning to Generate Semantically Correct Hindi Captions, by Wasim Akram Khan and 1 other authors View PDF Abstract:Automated image captioning using the content from the image is very appealing when done by harnessing the capability of computer vision and natural language processing. Extensive research has been done in the field with a major focus on the English language which gives the scope for further developments in the same with consideration of popular foreign languages. This research utilizes distinct models for translating the image caption into Hindi, the fourth most popular language across the world. Exploring the multi-modal architectures this research comprises local visual features, global visual features, attention mechanisms, and pre-trained models. Using google cloud translator on the image dataset from Flickr8k, Hindi image descriptions have been generated. Pre-trained CNNs like VGG16, ResNet50, and Inception V3 helped in retrieving image characteristics, while the uni-directional and bi-directional techniques of text encoding are used for the text encoding process. An additional Attention layer helps to generate a weight vector and, by multiplying it, combine image characteristics from each ...

Related Articles

Machine Learning

[D] Does ML have a "bible"/reference textbook at the Intermediate/Advanced level?

Hello, everyone! This is my first time posting here and I apologise if the question is, perhaps, a bit too basic for this sub-reddit. A b...

Reddit - Machine Learning · 1 min ·
Machine Learning

[D] ICML 2026 review policy debate: 100 responses suggest Policy B may score higher, while Policy A shows higher confidence

A week ago I made a thread asking whether ICML 2026’s review policy might have affected review outcomes, especially whether Policy A pape...

Reddit - Machine Learning · 1 min ·
Nomadic raises $8.4 million to wrangle the data pouring off autonomous vehicles | TechCrunch
Machine Learning

Nomadic raises $8.4 million to wrangle the data pouring off autonomous vehicles | TechCrunch

The company turns footage from robots into structured, searchable datasets with a deep learning model.

TechCrunch - AI · 6 min ·
Machine Learning

[D] Applied AI/Machine learning course by Srikanth Varma

I have all 10 modules of this course, along with all the notes, assignments, and solutions. If anyone need this course DM me. submitted b...

Reddit - Machine Learning · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime