Computer Vision

Image recognition, detection, and visual AI

Top This Week

[2506.22504] Patch2Loc: Learning to Localize Patches for Unsupervised Brain Lesion Detection
Machine Learning

[2506.22504] Patch2Loc: Learning to Localize Patches for Unsupervised Brain Lesion Detection

Abstract page for arXiv paper 2506.22504: Patch2Loc: Learning to Localize Patches for Unsupervised Brain Lesion Detection

arXiv - Machine Learning · 4 min ·
[2508.00307] Acoustic Imaging for Low-SNR UAV Detection: Dense Beamformed Energy Maps and U-Net SELD
Machine Learning

[2508.00307] Acoustic Imaging for Low-SNR UAV Detection: Dense Beamformed Energy Maps and U-Net SELD

Abstract page for arXiv paper 2508.00307: Acoustic Imaging for Low-SNR UAV Detection: Dense Beamformed Energy Maps and U-Net SELD

arXiv - AI · 4 min ·
[2603.25524] CHIRP dataset: towards long-term, individual-level, behavioral monitoring of bird populations in the wild
Computer Vision

[2603.25524] CHIRP dataset: towards long-term, individual-level, behavioral monitoring of bird populations in the wild

Abstract page for arXiv paper 2603.25524: CHIRP dataset: towards long-term, individual-level, behavioral monitoring of bird populations i...

arXiv - AI · 4 min ·

All Content

[2602.21203] Squint: Fast Visual Reinforcement Learning for Sim-to-Real Robotics
Machine Learning

[2602.21203] Squint: Fast Visual Reinforcement Learning for Sim-to-Real Robotics

The paper presents Squint, a novel visual reinforcement learning method that enhances training efficiency for sim-to-real robotics, achie...

arXiv - Machine Learning · 3 min ·
[2602.21142] LUMEN: Longitudinal Multi-Modal Radiology Model for Prognosis and Diagnosis
Llms

[2602.21142] LUMEN: Longitudinal Multi-Modal Radiology Model for Prognosis and Diagnosis

The LUMEN model enhances radiological diagnosis by leveraging longitudinal imaging data and multi-modal training, improving prognostic ca...

arXiv - Machine Learning · 4 min ·
[2602.20901] SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models
Llms

[2602.20901] SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models

The paper introduces SpatiaLQA, a benchmark for evaluating spatial logical reasoning in Vision-Language Models (VLMs), highlighting their...

arXiv - Machine Learning · 4 min ·
[2602.20616] Knowing the Unknown: Interpretable Open-World Object Detection via Concept Decomposition Model
Machine Learning

[2602.20616] Knowing the Unknown: Interpretable Open-World Object Detection via Concept Decomposition Model

This article presents a novel approach to open-world object detection through an interpretable framework that enhances the identification...

arXiv - Machine Learning · 4 min ·
[2602.20555] Standard Transformers Achieve the Minimax Rate in Nonparametric Regression with $C^{s,λ}$ Targets
Llms

[2602.20555] Standard Transformers Achieve the Minimax Rate in Nonparametric Regression with $C^{s,λ}$ Targets

This paper demonstrates that standard Transformers can achieve the minimax optimal rate in nonparametric regression for Hölder functions,...

arXiv - Machine Learning · 3 min ·
[2602.20465] Prior-Agnostic Incentive-Compatible Exploration
Ai Safety

[2602.20465] Prior-Agnostic Incentive-Compatible Exploration

The paper discusses a novel approach to incentive-compatible exploration in bandit settings, addressing the misalignment between principa...

arXiv - Machine Learning · 3 min ·
[2602.20165] VISION-ICE: Video-based Interpretation and Spatial Identification of Arrhythmia Origins via Neural Networks in Intracardiac Echocardiography
Machine Learning

[2602.20165] VISION-ICE: Video-based Interpretation and Spatial Identification of Arrhythmia Origins via Neural Networks in Intracardiac Echocardiography

The paper presents VISION-ICE, an AI framework utilizing intracardiac echocardiography to identify arrhythmia origins, achieving 66.2% ac...

arXiv - Machine Learning · 4 min ·
[2602.21081] Scaling Vision Transformers: Evaluating DeepSpeed for Image-Centric Workloads
Llms

[2602.21081] Scaling Vision Transformers: Evaluating DeepSpeed for Image-Centric Workloads

This article evaluates the use of DeepSpeed to enhance the scalability of Vision Transformers (ViTs) for image-centric workloads, focusin...

arXiv - Machine Learning · 3 min ·
[2602.20549] Sample-efficient evidence estimation of score based priors for model selection
Machine Learning

[2602.20549] Sample-efficient evidence estimation of score based priors for model selection

The paper presents a novel estimator for model evidence in Bayesian inverse problems, particularly using diffusion models, enhancing samp...

arXiv - Machine Learning · 4 min ·
[2602.20360] Momentum Guidance: Plug-and-Play Guidance for Flow Models
Machine Learning

[2602.20360] Momentum Guidance: Plug-and-Play Guidance for Flow Models

The paper introduces Momentum Guidance (MG), a novel technique for enhancing flow-based generative models, achieving significant improvem...

arXiv - Machine Learning · 3 min ·
[2602.20309] QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
Machine Learning

[2602.20309] QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

QuantVLA introduces a novel post-training quantization framework for Vision-Language-Action models, enhancing efficiency without addition...

arXiv - Machine Learning · 4 min ·
[2602.09050] SAS-Net: Scene-Appearance Separation Network for Robust Spatiotemporal Registration in Bidirectional Photoacoustic Microscopy
Ai Safety

[2602.09050] SAS-Net: Scene-Appearance Separation Network for Robust Spatiotemporal Registration in Bidirectional Photoacoustic Microscopy

The paper introduces SAS-Net, a novel framework for robust spatiotemporal registration in bidirectional photoacoustic microscopy, address...

arXiv - AI · 4 min ·
[2602.09082] UI-Venus-1.5 Technical Report
Machine Learning

[2602.09082] UI-Venus-1.5 Technical Report

The UI-Venus-1.5 Technical Report presents advancements in GUI agents, detailing a unified model that enhances task performance across va...

arXiv - Machine Learning · 4 min ·
[2602.02620] CryoLVM: Self-supervised Learning from Cryo-EM Density Maps with Large Vision Models
Machine Learning

[2602.02620] CryoLVM: Self-supervised Learning from Cryo-EM Density Maps with Large Vision Models

CryoLVM introduces a self-supervised learning model for cryo-electron microscopy (cryo-EM) density maps, enhancing structural representat...

arXiv - Machine Learning · 3 min ·
[2601.11675] Generating metamers of human scene understanding
Machine Learning

[2601.11675] Generating metamers of human scene understanding

This article presents MetamerGen, a novel tool that generates metamers of human scene understanding by combining low-resolution gist info...

arXiv - AI · 4 min ·
[2601.10611] Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Llms

[2601.10611] Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Molmo2 introduces a new family of open-weight vision-language models that excel in video understanding and grounding, featuring innovativ...

arXiv - AI · 4 min ·
[2601.09708] Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Machine Learning

[2601.09708] Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

The paper presents Fast-ThinkAct, a novel framework for efficient Vision-Language-Action reasoning that reduces inference latency while m...

arXiv - Machine Learning · 3 min ·
[2601.01874] CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving
Llms

[2601.01874] CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving

CogFlow introduces a novel framework for visual mathematical problem solving, enhancing perception and reasoning through knowledge intern...

arXiv - AI · 4 min ·
[2511.17844] Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation
Machine Learning

[2511.17844] Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation

This article presents a novel data-efficient approach for fine-tuning text-to-video generation models, demonstrating that low-quality syn...

arXiv - AI · 3 min ·
[2511.02565] A Cognitive Process-Inspired Architecture for Subject-Agnostic Brain Visual Decoding
Machine Learning

[2511.02565] A Cognitive Process-Inspired Architecture for Subject-Agnostic Brain Visual Decoding

The paper presents VCFlow, a novel architecture for subject-agnostic brain visual decoding, enhancing the reconstruction of visual experi...

arXiv - AI · 4 min ·
Previous Page 16 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime