[2602.07680] Vision and Language: Novel Representations and Artificial intelligence for Driving Scene Safety Assessment and Autonomous Vehicle Planning
Summary
This paper explores the integration of vision-language models in autonomous driving, focusing on safety assessment and decision-making through novel representations.
Why It Matters
As autonomous vehicles become more prevalent, ensuring their safety is critical. This research highlights how vision-language models can enhance hazard detection and decision-making, potentially leading to safer driving environments and improved technology integration in autonomous systems.
Key Takeaways
- Vision-language models can improve safety assessment in autonomous driving.
- A lightweight hazard screening approach can detect diverse road hazards effectively.
- Integrating scene-level embeddings into planning frameworks requires careful alignment with tasks.
- Natural language can serve as a behavioral constraint, enhancing safety in ambiguous scenarios.
- The findings emphasize the need for structured system design in implementing these models.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.07680 (cs) [Submitted on 7 Feb 2026 (v1), last revised 18 Feb 2026 (this version, v2)] Title:Vision and Language: Novel Representations and Artificial intelligence for Driving Scene Safety Assessment and Autonomous Vehicle Planning Authors:Ross Greer, Maitrayee Keskar, Angel Martinez-Sanchez, Parthib Roy, Shashank Shriram, Mohan Trivedi View a PDF of the paper titled Vision and Language: Novel Representations and Artificial intelligence for Driving Scene Safety Assessment and Autonomous Vehicle Planning, by Ross Greer and 5 other authors View PDF Abstract:Vision-language models (VLMs) have recently emerged as powerful representation learning systems that align visual observations with natural language concepts, offering new opportunities for semantic reasoning in safety-critical autonomous driving. This paper investigates how vision-language representations support driving scene safety assessment and decision-making when integrated into perception, prediction, and planning pipelines. We study three complementary system-level use cases. First, we introduce a lightweight, category-agnostic hazard screening approach leveraging CLIP-based image-text similarity to produce a low-latency semantic hazard signal. This enables robust detection of diverse and out-of-distribution road hazards without explicit object detection or visual question answering. Second, we examine the integration of scene-level vision-lang...