Computer Vision

Image recognition, detection, and visual AI

Top This Week

[2511.21428] From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings
Machine Learning

[2511.21428] From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings

Abstract page for arXiv paper 2511.21428: From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in ...

arXiv - AI · 4 min ·
[2511.16719] SAM 3: Segment Anything with Concepts
Machine Learning

[2511.16719] SAM 3: Segment Anything with Concepts

Abstract page for arXiv paper 2511.16719: SAM 3: Segment Anything with Concepts

arXiv - AI · 4 min ·
[2603.28594] Detection of Adversarial Attacks in Robotic Perception
Machine Learning

[2603.28594] Detection of Adversarial Attacks in Robotic Perception

Abstract page for arXiv paper 2603.28594: Detection of Adversarial Attacks in Robotic Perception

arXiv - AI · 3 min ·

All Content

[2508.01423] 3DRot: Rediscovering the Missing Primitive for RGB-Based 3D Augmentation
Computer Vision

[2508.01423] 3DRot: Rediscovering the Missing Primitive for RGB-Based 3D Augmentation

The paper introduces 3DRot, a novel RGB-based 3D augmentation technique that enhances geometric consistency in 3D tasks by enabling effec...

arXiv - Machine Learning · 4 min ·
[2507.07139] Image Can Bring Your Memory Back: A Novel Multi-Modal Guided Attack against Image Generation Model Unlearning
Machine Learning

[2507.07139] Image Can Bring Your Memory Back: A Novel Multi-Modal Guided Attack against Image Generation Model Unlearning

The paper presents Recall, a novel adversarial framework that targets the robustness of image generation model unlearning, revealing vuln...

arXiv - Machine Learning · 4 min ·
[2505.18487] Grounding Bodily Awareness in Visual Representations for Efficient Policy Learning
Robotics

[2505.18487] Grounding Bodily Awareness in Visual Representations for Efficient Policy Learning

This paper explores how grounding bodily awareness in visual representations can enhance policy learning for robotic manipulation, introd...

arXiv - Machine Learning · 3 min ·
[2509.13229] Curriculum Multi-Task Self-Supervision Improves Lightweight Architectures for Onboard Satellite Hyperspectral Image Segmentation
Machine Learning

[2509.13229] Curriculum Multi-Task Self-Supervision Improves Lightweight Architectures for Onboard Satellite Hyperspectral Image Segmentation

This article presents a novel framework, Curriculum Multi-Task Self-Supervision Learning (CMTSSL), aimed at enhancing lightweight archite...

arXiv - Machine Learning · 4 min ·
[2412.14294] TRecViT: A Recurrent Video Transformer
Machine Learning

[2412.14294] TRecViT: A Recurrent Video Transformer

TRecViT introduces a novel recurrent video transformer architecture that excels in causal video modeling, outperforming existing models w...

arXiv - Machine Learning · 4 min ·
[2508.19300] CellINR: Implicitly Overcoming Photo-induced Artifacts in 4D Live Fluorescence Microscopy
Computer Vision

[2508.19300] CellINR: Implicitly Overcoming Photo-induced Artifacts in 4D Live Fluorescence Microscopy

The paper presents CellINR, a novel framework designed to mitigate photo-induced artifacts in 4D live fluorescence microscopy, enhancing ...

arXiv - AI · 4 min ·
[2508.07514] Robust MultiSpecies Agricultural Segmentation Across Devices, Seasons, and Sensors Using Hierarchical DINOv2 Models
Machine Learning

[2508.07514] Robust MultiSpecies Agricultural Segmentation Across Devices, Seasons, and Sensors Using Hierarchical DINOv2 Models

This article presents a robust segmentation framework using Hierarchical DINOv2 models for reliable plant species and damage identificati...

arXiv - AI · 4 min ·
[2403.15605] Efficiently Assemble Normalization Layers and Regularization for Federated Domain Generalization
Machine Learning

[2403.15605] Efficiently Assemble Normalization Layers and Regularization for Federated Domain Generalization

The paper presents a novel method, gPerXAN, for Federated Domain Generalization (FedDG) that enhances model performance by effectively as...

arXiv - Machine Learning · 4 min ·
[2506.11526] Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis
Llms

[2506.11526] Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis

This survey explores the role of foundation models in enhancing scenario generation and analysis for autonomous driving, addressing limit...

arXiv - AI · 4 min ·
[2506.03407] Multi-Spectral Gaussian Splatting with Neural Color Representation
Machine Learning

[2506.03407] Multi-Spectral Gaussian Splatting with Neural Color Representation

The paper presents MS-Splatting, a novel multi-spectral 3D Gaussian Splatting framework that generates consistent views from images captu...

arXiv - Machine Learning · 4 min ·
[2505.12641] Single Image Reflection Separation via Dual Prior Interaction Transformer
Machine Learning

[2505.12641] Single Image Reflection Separation via Dual Prior Interaction Transformer

This paper presents a novel approach to single image reflection separation using a Dual Prior Interaction Transformer, enhancing the extr...

arXiv - AI · 4 min ·
[2503.04641] Simulating the Real World: A Unified Survey of Multimodal Generative Models
Machine Learning

[2503.04641] Simulating the Real World: A Unified Survey of Multimodal Generative Models

This article presents a comprehensive survey of multimodal generative models, focusing on their integration from 2D to 4D representations...

arXiv - Machine Learning · 4 min ·
[2412.00686] LVLM-COUNT: Enhancing the Counting Ability of Large Vision-Language Models
Llms

[2412.00686] LVLM-COUNT: Enhancing the Counting Ability of Large Vision-Language Models

The paper presents LVLM-COUNT, a method to enhance the counting ability of large vision-language models (LVLMs) by using a divide-and-con...

arXiv - AI · 4 min ·
[2511.11681] MPCM-Net: Multi-scale network integrates partial attention convolution with Mamba for ground-based cloud image segmentation
Machine Learning

[2511.11681] MPCM-Net: Multi-scale network integrates partial attention convolution with Mamba for ground-based cloud image segmentation

The paper presents MPCM-Net, a novel multi-scale network that enhances ground-based cloud image segmentation through partial attention co...

arXiv - Machine Learning · 4 min ·
[2510.03574] Efficient Test-Time Scaling for Small Vision-Language Models
Llms

[2510.03574] Efficient Test-Time Scaling for Small Vision-Language Models

The paper presents efficient test-time scaling strategies for small vision-language models (VLMs) to enhance their performance without co...

arXiv - Machine Learning · 3 min ·
[2303.09807] TKN: Transformer-based Keypoint Prediction Network For Real-time Video Prediction
Machine Learning

[2303.09807] TKN: Transformer-based Keypoint Prediction Network For Real-time Video Prediction

The paper presents TKN, a transformer-based neural network designed for real-time video prediction, achieving a remarkable prediction rat...

arXiv - AI · 4 min ·
[2602.07849] LQA: A Lightweight Quantized-Adaptive Framework for Vision-Language Models on the Edge
Llms

[2602.07849] LQA: A Lightweight Quantized-Adaptive Framework for Vision-Language Models on the Edge

The paper presents LQA, a lightweight quantized-adaptive framework designed to enhance the deployment of Vision-Language Models (VLMs) on...

arXiv - AI · 3 min ·
[2602.05847] OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention
Machine Learning

[2602.05847] OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention

The paper introduces OmniVideo-R1, a novel framework designed to enhance audio-visual reasoning through query intention and modality atte...

arXiv - AI · 3 min ·
[2507.22554] DeepC4: Deep Conditional Census-Constrained Clustering for Large-scale Multitask Spatial Disaggregation of Urban Morphology
Machine Learning

[2507.22554] DeepC4: Deep Conditional Census-Constrained Clustering for Large-scale Multitask Spatial Disaggregation of Urban Morphology

The paper presents DeepC4, a novel deep learning approach for spatial disaggregation of urban morphology, enhancing mapping quality using...

arXiv - Machine Learning · 4 min ·
[2510.10689] OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs
Llms

[2510.10689] OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs

The paper introduces OmniVideoBench, a benchmark designed to evaluate audio-visual understanding in multimodal large language models (MLL...

arXiv - AI · 4 min ·
Previous Page 39 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime