Computer Vision

Image recognition, detection, and visual AI

Top This Week

[2511.21428] From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings
Machine Learning

[2511.21428] From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings

Abstract page for arXiv paper 2511.21428: From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in ...

arXiv - AI · 4 min ·
[2511.16719] SAM 3: Segment Anything with Concepts
Machine Learning

[2511.16719] SAM 3: Segment Anything with Concepts

Abstract page for arXiv paper 2511.16719: SAM 3: Segment Anything with Concepts

arXiv - AI · 4 min ·
[2603.28594] Detection of Adversarial Attacks in Robotic Perception
Machine Learning

[2603.28594] Detection of Adversarial Attacks in Robotic Perception

Abstract page for arXiv paper 2603.28594: Detection of Adversarial Attacks in Robotic Perception

arXiv - AI · 3 min ·

All Content

[2602.13329] HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving
Machine Learning

[2602.13329] HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving

The HiST-VLA model enhances autonomous driving by integrating vision, language, and action through improved spatio-temporal reasoning and...

arXiv - AI · 3 min ·
[2602.13324] Synthesizing the Kill Chain: A Zero-Shot Framework for Target Verification and Tactical Reasoning on the Edge
Llms

[2602.13324] Synthesizing the Kill Chain: A Zero-Shot Framework for Target Verification and Tactical Reasoning on the Edge

This paper presents a zero-shot framework for target verification and tactical reasoning in autonomous edge robotics, addressing challeng...

arXiv - AI · 4 min ·
[2602.13315] IDPruner: Harmonizing Importance and Diversity in Visual Token Pruning for MLLMs
Llms

[2602.13315] IDPruner: Harmonizing Importance and Diversity in Visual Token Pruning for MLLMs

The paper presents IDPruner, a novel method for visual token pruning in Multimodal Large Language Models (MLLMs), balancing importance an...

arXiv - AI · 4 min ·
[2602.13314] Sim2Radar: Toward Bridging the Radar Sim-to-Real Gap with VLM-Guided Scene Reconstruction
Machine Learning

[2602.13314] Sim2Radar: Toward Bridging the Radar Sim-to-Real Gap with VLM-Guided Scene Reconstruction

The paper presents Sim2Radar, a framework that generates synthetic radar data from RGB images, addressing the challenges of limited radar...

arXiv - AI · 3 min ·
[2602.13313] Agentic Spatio-Temporal Grounding via Collaborative Reasoning
Ai Agents

[2602.13313] Agentic Spatio-Temporal Grounding via Collaborative Reasoning

The paper presents the Agentic Spatio-Temporal Grounder (ASTG), a novel framework for Spatio-Temporal Video Grounding (STVG) that enhance...

arXiv - AI · 3 min ·
[2602.13310] Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension
Llms

[2602.13310] Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension

The paper introduces Visual Para-Thinker, a novel framework for parallel reasoning in visual comprehension, addressing limitations in exi...

arXiv - AI · 3 min ·
[2602.13308] Learning to Select Like Humans: Explainable Active Learning for Medical Imaging
Machine Learning

[2602.13308] Learning to Select Like Humans: Explainable Active Learning for Medical Imaging

This paper presents an explainable active learning framework for medical imaging that enhances data efficiency and interpretability by in...

arXiv - AI · 4 min ·
[2602.13306] Fine-Tuning a Large Vision-Language Model for Artwork's Scoring and Critique
Llms

[2602.13306] Fine-Tuning a Large Vision-Language Model for Artwork's Scoring and Critique

This paper presents a framework for automating the scoring and critique of artwork using a fine-tuned vision-language model, achieving hi...

arXiv - Machine Learning · 4 min ·
[2602.13305] WildfireVLM: AI-powered Analysis for Early Wildfire Detection and Risk Assessment Using Satellite Imagery
Computer Vision

[2602.13305] WildfireVLM: AI-powered Analysis for Early Wildfire Detection and Risk Assessment Using Satellite Imagery

WildfireVLM introduces an AI framework for early wildfire detection and risk assessment using satellite imagery, enhancing disaster manag...

arXiv - AI · 4 min ·
[2602.13304] Progressive Contrast Registration for High-Fidelity Bidirectional Photoacoustic Microscopy Alignment
Ai Safety

[2602.13304] Progressive Contrast Registration for High-Fidelity Bidirectional Photoacoustic Microscopy Alignment

This article presents PCReg-Net, a novel framework for high-fidelity alignment in bidirectional photoacoustic microscopy, significantly i...

arXiv - AI · 3 min ·
[2602.13303] Spectral Collapse in Diffusion Inversion
Generative Ai

[2602.13303] Spectral Collapse in Diffusion Inversion

The paper discusses 'spectral collapse' in diffusion inversion, highlighting failures in standard deterministic methods for image transla...

arXiv - Machine Learning · 3 min ·
[2602.13299] KidMesh: Computational Mesh Reconstruction for Pediatric Congenital Hydronephrosis Using Deep Neural Networks
Machine Learning

[2602.13299] KidMesh: Computational Mesh Reconstruction for Pediatric Congenital Hydronephrosis Using Deep Neural Networks

The paper presents KidMesh, a deep learning approach for reconstructing computational meshes for pediatric congenital hydronephrosis from...

arXiv - AI · 4 min ·
[2602.13298] Effect of Convolutional Depth on Image Recognition Performance: VGG vs. ResNet vs. GoogLeNet
Machine Learning

[2602.13298] Effect of Convolutional Depth on Image Recognition Performance: VGG vs. ResNet vs. GoogLeNet

This paper examines how convolutional depth affects image recognition performance across three architectures: VGG, ResNet, and GoogLeNet,...

arXiv - AI · 3 min ·
[2602.13294] VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction
Llms

[2602.13294] VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction

The paper introduces VisPhyWorld, a framework for evaluating physical reasoning in Multimodal Large Language Models (MLLMs) through code-...

arXiv - AI · 3 min ·
[2602.13289] Evaluating the Impact of Post-Training Quantization on Reliable VQA with Multimodal LLMs
Llms

[2602.13289] Evaluating the Impact of Post-Training Quantization on Reliable VQA with Multimodal LLMs

This paper evaluates the effects of Post-Training Quantization (PTQ) on the reliability and accuracy of Visual Question Answering (VQA) u...

arXiv - AI · 4 min ·
[2602.13286] Explanatory Interactive Machine Learning for Bias Mitigation in Visual Gender Classification
Machine Learning

[2602.13286] Explanatory Interactive Machine Learning for Bias Mitigation in Visual Gender Classification

This article explores Explanatory Interactive Machine Learning (XIL) as a method to mitigate bias in visual gender classification, demons...

arXiv - Machine Learning · 4 min ·
[2602.14318] In Transformer We Trust? A Perspective on Transformer Architecture Failure Modes
Machine Learning

[2602.14318] In Transformer We Trust? A Perspective on Transformer Architecture Failure Modes

The paper examines the trustworthiness of transformer architectures in high-stakes applications, analyzing their reliability, interpretab...

arXiv - Machine Learning · 4 min ·
[2602.14078] Policy Gradient with Adaptive Entropy Annealing for Continual Fine-Tuning
Machine Learning

[2602.14078] Policy Gradient with Adaptive Entropy Annealing for Continual Fine-Tuning

This paper presents a novel approach, Adaptive Entropy Annealing (aEPG), to enhance continual fine-tuning of large pretrained vision mode...

arXiv - AI · 4 min ·
[2602.14225] Text Before Vision: Staged Knowledge Injection Matters for Agentic RLVR in Ultra-High-Resolution Remote Sensing Understanding
Machine Learning

[2602.14225] Text Before Vision: Staged Knowledge Injection Matters for Agentic RLVR in Ultra-High-Resolution Remote Sensing Understanding

This paper explores the significance of staged knowledge injection in enhancing agentic reinforcement learning for ultra-high-resolution ...

arXiv - AI · 4 min ·
[2602.13710] HBVLA: Pushing 1-Bit Post-Training Quantization for Vision-Language-Action Models
Machine Learning

[2602.13710] HBVLA: Pushing 1-Bit Post-Training Quantization for Vision-Language-Action Models

The paper presents HBVLA, a framework for 1-bit post-training quantization of Vision-Language-Action models, enhancing efficiency while m...

arXiv - Machine Learning · 4 min ·
Previous Page 43 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime