Computer Vision

Image recognition, detection, and visual AI

Top This Week

[2511.21428] From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings
Machine Learning

[2511.21428] From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings

Abstract page for arXiv paper 2511.21428: From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in ...

arXiv - AI · 4 min ·
[2511.16719] SAM 3: Segment Anything with Concepts
Machine Learning

[2511.16719] SAM 3: Segment Anything with Concepts

Abstract page for arXiv paper 2511.16719: SAM 3: Segment Anything with Concepts

arXiv - AI · 4 min ·
[2603.28594] Detection of Adversarial Attacks in Robotic Perception
Machine Learning

[2603.28594] Detection of Adversarial Attacks in Robotic Perception

Abstract page for arXiv paper 2603.28594: Detection of Adversarial Attacks in Robotic Perception

arXiv - AI · 3 min ·

All Content

[2602.14941] AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories
Generative Ai

[2602.14941] AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories

AnchorWeave introduces a novel framework for video generation that enhances spatial consistency over long durations by utilizing multiple...

arXiv - AI · 4 min ·
[2602.14879] CT-Bench: A Benchmark for Multimodal Lesion Understanding in Computed Tomography
Data Science

[2602.14879] CT-Bench: A Benchmark for Multimodal Lesion Understanding in Computed Tomography

CT-Bench introduces a benchmark dataset for multimodal lesion understanding in CT scans, featuring 20,335 lesions and a visual question a...

arXiv - AI · 3 min ·
[2602.14834] Debiasing Central Fixation Confounds Reveals a Peripheral "Sweet Spot" for Human-like Scanpaths in Hard-Attention Vision
Machine Learning

[2602.14834] Debiasing Central Fixation Confounds Reveals a Peripheral "Sweet Spot" for Human-like Scanpaths in Hard-Attention Vision

This paper explores the impact of central fixation bias on evaluating human-like scanpaths in vision models, proposing a new metric to im...

arXiv - AI · 4 min ·
[2602.14788] VIPA: Visual Informative Part Attention for Referring Image Segmentation
Nlp

[2602.14788] VIPA: Visual Informative Part Attention for Referring Image Segmentation

The paper presents VIPA, a novel framework for Referring Image Segmentation that enhances attention mechanisms by leveraging informative ...

arXiv - AI · 4 min ·
[2602.14989] ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery
Llms

[2602.14989] ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery

ThermEval introduces a benchmark for evaluating vision-language models on thermal imagery, highlighting their limitations in temperature-...

arXiv - AI · 4 min ·
[2602.14846] Multi-dimensional Persistent Sheaf Laplacians for Image Analysis
Nlp

[2602.14846] Multi-dimensional Persistent Sheaf Laplacians for Image Analysis

This paper introduces a multi-dimensional persistent sheaf Laplacian (MPSL) framework for image analysis, enhancing dimensionality reduct...

arXiv - Machine Learning · 3 min ·
[2602.14771] GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture
Machine Learning

[2602.14771] GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture

GOT-JEPA introduces a novel framework for generic object tracking that enhances model adaptation and occlusion handling, improving robust...

arXiv - AI · 4 min ·
[2602.14482] TikArt: Aperture-Guided Observation for Fine-Grained Visual Reasoning via Reinforcement Learning
Llms

[2602.14482] TikArt: Aperture-Guided Observation for Fine-Grained Visual Reasoning via Reinforcement Learning

The paper presents TikArt, an aperture-guided agent for fine-grained visual reasoning in multimodal large language models, enhancing deci...

arXiv - AI · 3 min ·
[2602.14464] CoCoDiff: Correspondence-Consistent Diffusion Model for Fine-grained Style Transfer
Machine Learning

[2602.14464] CoCoDiff: Correspondence-Consistent Diffusion Model for Fine-grained Style Transfer

The paper presents CoCoDiff, a novel framework for fine-grained style transfer in images, emphasizing semantic correspondence and achievi...

arXiv - AI · 3 min ·
[2602.14615] VariViT: A Vision Transformer for Variable Image Sizes
Machine Learning

[2602.14615] VariViT: A Vision Transformer for Variable Image Sizes

The paper introduces VariViT, a Vision Transformer designed to effectively handle variable image sizes, improving feature representation ...

arXiv - AI · 4 min ·
[2602.14408] Feature Recalibration Based Olfactory-Visual Multimodal Model for Fine-Grained Rice Deterioration Detection
Machine Learning

[2602.14408] Feature Recalibration Based Olfactory-Visual Multimodal Model for Fine-Grained Rice Deterioration Detection

The paper presents a novel olfactory-visual multimodal model for detecting fine-grained rice deterioration, achieving high accuracy and s...

arXiv - AI · 3 min ·
[2602.14498] Uncertainty-Aware Vision-Language Segmentation for Medical Imaging
Machine Learning

[2602.14498] Uncertainty-Aware Vision-Language Segmentation for Medical Imaging

This paper presents a novel uncertainty-aware multimodal segmentation framework that integrates radiological images and clinical text to ...

arXiv - Machine Learning · 4 min ·
[2602.14401] pFedNavi: Structure-Aware Personalized Federated Vision-Language Navigation for Embodied AI
Machine Learning

[2602.14401] pFedNavi: Structure-Aware Personalized Federated Vision-Language Navigation for Embodied AI

The paper presents pFedNavi, a personalized federated learning framework for Vision-Language Navigation (VLN) that addresses privacy conc...

arXiv - AI · 3 min ·
[2602.14381] Adapting VACE for Real-Time Autoregressive Video Diffusion
Generative Ai

[2602.14381] Adapting VACE for Real-Time Autoregressive Video Diffusion

This article presents an adaptation of VACE for real-time autoregressive video generation, enhancing video control while addressing laten...

arXiv - AI · 3 min ·
[2602.14365] Image-based Joint-level Detection for Inflammation in Rheumatoid Arthritis from Small and Imbalanced Data
Computer Vision

[2602.14365] Image-based Joint-level Detection for Inflammation in Rheumatoid Arthritis from Small and Imbalanced Data

This paper presents a novel framework for detecting joint inflammation in rheumatoid arthritis using RGB images, addressing challenges li...

arXiv - AI · 4 min ·
[2602.14345] AXE: An Agentic eXploit Engine for Confirming Zero-Day Vulnerability Reports
Nlp

[2602.14345] AXE: An Agentic eXploit Engine for Confirming Zero-Day Vulnerability Reports

The paper presents AXE, an innovative framework for validating zero-day vulnerabilities using minimal metadata, achieving a significant i...

arXiv - AI · 4 min ·
[2602.14193] Learning Part-Aware Dense 3D Feature Field for Generalizable Articulated Object Manipulation
Robotics

[2602.14193] Learning Part-Aware Dense 3D Feature Field for Generalizable Articulated Object Manipulation

The paper presents a novel Part-Aware 3D Feature Field (PA3FF) for enhancing robotic manipulation of articulated objects, addressing chal...

arXiv - Machine Learning · 4 min ·
[2602.14236] Dual-Signal Adaptive KV-Cache Optimization for Long-Form Video Understanding in Vision-Language Models
Llms

[2602.14236] Dual-Signal Adaptive KV-Cache Optimization for Long-Form Video Understanding in Vision-Language Models

The paper presents Sali-Cache, a novel optimization framework for Vision-Language Models (VLMs) that addresses memory bottlenecks in long...

arXiv - AI · 3 min ·
[2602.14237] AbracADDbra: Touch-Guided Object Addition by Decoupling Placement and Editing Subtasks
Machine Learning

[2602.14237] AbracADDbra: Touch-Guided Object Addition by Decoupling Placement and Editing Subtasks

The paper presents AbracADDbra, a framework that enhances object addition in computer vision by decoupling placement and editing tasks th...

arXiv - AI · 3 min ·
[2602.14201] GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery
Llms

[2602.14201] GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery

GeoEyes introduces a novel framework for enhancing visual understanding in ultra-high-resolution remote sensing imagery, addressing limit...

arXiv - AI · 3 min ·
Previous Page 40 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime