Computer Vision

Image recognition, detection, and visual AI

Top This Week

[2602.09678] Administrative Law's Fourth Settlement: AI and the Capability-Accountability Trap
Computer Vision

[2602.09678] Administrative Law's Fourth Settlement: AI and the Capability-Accountability Trap

Abstract page for arXiv paper 2602.09678: Administrative Law's Fourth Settlement: AI and the Capability-Accountability Trap

arXiv - AI · 4 min ·
[2601.13622] CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models
Llms

[2601.13622] CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models

Abstract page for arXiv paper 2601.13622: CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language...

arXiv - AI · 3 min ·
[2603.26551] Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones
Computer Vision

[2603.26551] Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones

Abstract page for arXiv paper 2603.26551: Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones

arXiv - AI · 4 min ·

All Content

[2503.13444] VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning
Llms

[2503.13444] VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning

VideoMind introduces a novel approach for temporal-grounded video reasoning using a Chain-of-LoRA agent, enhancing multi-modal reasoning ...

arXiv - AI · 4 min ·
[2407.17412] (PASS) Visual Prompt Locates Good Structure Sparsity through a Recurrent HyperNetwork
Machine Learning

[2407.17412] (PASS) Visual Prompt Locates Good Structure Sparsity through a Recurrent HyperNetwork

The paper presents PASS, a novel algorithmic framework that utilizes visual prompts to enhance structural sparsity in neural networks, im...

arXiv - AI · 4 min ·
[2405.14504] Adaptive Runge-Kutta Dynamics for Spatiotemporal Prediction
Machine Learning

[2405.14504] Adaptive Runge-Kutta Dynamics for Spatiotemporal Prediction

The paper presents an innovative approach using an adaptive Runge-Kutta method for spatiotemporal prediction, enhancing model accuracy in...

arXiv - AI · 4 min ·
[2211.12817] Learning to See the Elephant in the Room: Self-Supervised Context Reasoning in Humans and AI
Machine Learning

[2211.12817] Learning to See the Elephant in the Room: Self-Supervised Context Reasoning in Humans and AI

This article explores self-supervised context reasoning in humans and AI, presenting a model called SeCo that learns contextual relations...

arXiv - AI · 4 min ·
[2510.13205] CleverCatch: A Knowledge-Guided Weak Supervision Model for Fraud Detection
Machine Learning

[2510.13205] CleverCatch: A Knowledge-Guided Weak Supervision Model for Fraud Detection

CleverCatch introduces a knowledge-guided weak supervision model for detecting healthcare fraud, enhancing accuracy and interpretability ...

arXiv - AI · 4 min ·
[2601.21468] MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning
Computer Vision

[2601.21468] MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning

MemOCR introduces a multimodal memory agent that enhances long-horizon reasoning by using layout-aware visual memory, optimizing context ...

arXiv - AI · 3 min ·
[2601.05500] The Illusion of Human AI Parity Under Uncertainty: Navigating Elusive Ground Truth via a Probabilistic Paradigm
Llms

[2601.05500] The Illusion of Human AI Parity Under Uncertainty: Navigating Elusive Ground Truth via a Probabilistic Paradigm

This paper discusses the impact of uncertainty in ground truth evaluations on AI performance assessments, proposing a probabilistic frame...

arXiv - AI · 4 min ·
[2510.27623] BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning
Llms

[2510.27623] BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning

The paper presents BEAT, a novel framework for executing visual backdoor attacks on Vision-Language Model (VLM)-based embodied agents, hi...

arXiv - AI · 4 min ·
[2510.00523] VIRTUE: Visual-Interactive Text-Image Universal Embedder
Llms

[2510.00523] VIRTUE: Visual-Interactive Text-Image Universal Embedder

The paper presents VIRTUE, a novel Visual-Interactive Text-Image Universal Embedder that enhances multimodal representation learning by i...

arXiv - AI · 4 min ·
[2509.03830] Decoding Tourist Perception in Historic Urban Quarters with Multimodal Social Media Data: An AI-Based Framework and Evidence from Shanghai
Data Science

[2509.03830] Decoding Tourist Perception in Historic Urban Quarters with Multimodal Social Media Data: An AI-Based Framework and Evidence from Shanghai

This study presents an AI-based framework to analyze tourist perceptions in historic urban quarters of Shanghai, utilizing multimodal soc...

arXiv - AI · 4 min ·
[2505.03646] GRILL: Restoring Gradient Signal in Ill-Conditioned Layers for More Effective Adversarial Attacks on Autoencoders
Machine Learning

[2505.03646] GRILL: Restoring Gradient Signal in Ill-Conditioned Layers for More Effective Adversarial Attacks on Autoencoders

The paper presents GRILL, a method to enhance adversarial attacks on autoencoders by restoring gradient signals in ill-conditioned layers...

arXiv - AI · 4 min ·
[2602.20119] NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning
Llms

[2602.20119] NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning

NovaPlan introduces a framework for zero-shot long-horizon manipulation in robotics, integrating video language planning with geometrical...

arXiv - AI · 4 min ·
[2602.20114] Benchmarking Unlearning for Vision Transformers
Machine Learning

[2602.20114] Benchmarking Unlearning for Vision Transformers

This article presents a benchmarking study on unlearning algorithms for Vision Transformers (VTs), highlighting their performance compare...

arXiv - AI · 4 min ·
[2602.20089] StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues
Llms

[2602.20089] StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues

The paper presents StructXLIP, a novel approach that enhances vision-language models by integrating multimodal structural cues, improving...

arXiv - AI · 4 min ·
[2602.20066] HeatPrompt: Zero-Shot Vision-Language Modeling of Urban Heat Demand from Satellite Images
Llms

[2602.20066] HeatPrompt: Zero-Shot Vision-Language Modeling of Urban Heat Demand from Satellite Images

The paper presents HeatPrompt, a zero-shot vision-language framework for estimating urban heat demand from satellite images, enhancing en...

arXiv - AI · 3 min ·
[2602.20065] Multilingual Large Language Models do not comprehend all natural languages to equal degrees
Llms

[2602.20065] Multilingual Large Language Models do not comprehend all natural languages to equal degrees

This article examines the performance of multilingual large language models (LLMs) across various languages, revealing that comprehension...

arXiv - AI · 4 min ·
[2602.20055] To Move or Not to Move: Constraint-based Planning Enables Zero-Shot Generalization for Interactive Navigation
Robotics

[2602.20055] To Move or Not to Move: Constraint-based Planning Enables Zero-Shot Generalization for Interactive Navigation

This paper presents a novel constraint-based planning framework for mobile robots, enabling zero-shot generalization in interactive navig...

arXiv - AI · 4 min ·
[2602.20159] A Very Big Video Reasoning Suite
Machine Learning

[2602.20159] A Very Big Video Reasoning Suite

The paper introduces the Very Big Video Reasoning (VBVR) Dataset, a large-scale resource for studying video reasoning capabilities, featu...

arXiv - Machine Learning · 4 min ·
[2602.20051] SEAL-pose: Enhancing 3D Human Pose Estimation via a Learned Loss for Structural Consistency
Computer Vision

[2602.20051] SEAL-pose: Enhancing 3D Human Pose Estimation via a Learned Loss for Structural Consistency

The SEAL-pose framework enhances 3D human pose estimation by utilizing a learned loss function that captures structural consistency among...

arXiv - AI · 4 min ·
[2602.20028] Descriptor: Dataset of Parasitoid Wasps and Associated Hymenoptera (DAPWH)
Data Science

[2602.20028] Descriptor: Dataset of Parasitoid Wasps and Associated Hymenoptera (DAPWH)

The article presents a curated dataset of parasitoid wasps and associated Hymenoptera, aimed at enhancing automated identification system...

arXiv - AI · 4 min ·
Previous Page 21 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime