[2509.21609] VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment
Summary
The paper presents VLCE, a framework that enhances image description for disaster assessment by integrating external semantic knowledge, improving the accuracy of visual data interpretation.
Why It Matters
This research addresses the limitations of current visual language models (VLMs) in disaster scenarios by introducing a method that combines AI with domain-specific knowledge. This advancement could significantly improve real-time disaster response efforts, making it highly relevant for emergency management and AI applications in crisis situations.
Key Takeaways
- VLCE integrates external knowledge sources like ConceptNet and WordNet to enhance image captioning in disaster assessments.
- The framework utilizes CNN-LSTM and Vision Transformer architectures to process satellite and UAV imagery effectively.
- VLCE outperforms baseline models, achieving a 95.33% accuracy on UAV imagery assessments.
- The research signifies a shift from basic visual classification to generating actionable intelligence for disaster management.
- Immediate applicability in real-time systems can improve disaster response efficiency.
Computer Science > Computer Vision and Pattern Recognition arXiv:2509.21609 (cs) [Submitted on 25 Sep 2025 (v1), last revised 17 Feb 2026 (this version, v5)] Title:VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment Authors:Md. Mahfuzur Rahman, Kishor Datta Gupta, Marufa Kamal, Fahad Rahman, Sunzida Siddique, Ahmed Rafi Hasan, Mohd Ariful Haque, Roy George View a PDF of the paper titled VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment, by Md. Mahfuzur Rahman and 7 other authors View PDF HTML (experimental) Abstract:The processes of classification and segmentation utilizing artificial intelligence play a vital role in the automation of disaster assessments. However, contemporary VLMs produce details that are inadequately aligned with the objectives of disaster assessment, primarily due to their deficiency in domain knowledge and the absence of a more refined descriptive process. This research presents the Vision Language Caption Enhancer (VLCE), a dedicated multimodal framework aimed at integrating external semantic knowledge from ConceptNet and WordNet to improve the captioning process. The objective is to produce disaster-specific descriptions that effectively convert raw visual data into actionable intelligence. VLCE utilizes two separate architectures: a CNN-LSTM model that incorporates a ResNet50 backbone, pretrained on EuroSat for satellite imagery (xBD dataset), and a Vision Transformer developed for UAV ...