Computer Vision

Image recognition, detection, and visual AI

This Week's Best | Monthly Best | Guide | Trending

Top This Week

Computer Vision

[2602.09678] Administrative Law's Fourth Settlement: AI and the Capability-Accountability Trap

Abstract page for arXiv paper 2602.09678: Administrative Law's Fourth Settlement: AI and the Capability-Accountability Trap

arXiv - AI · 4 min · about 2 hours ago

Llms

[2601.13622] CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models

Abstract page for arXiv paper 2601.13622: CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language...

arXiv - AI · 3 min · about 2 hours ago

Computer Vision

[2603.26551] Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones

Abstract page for arXiv paper 2603.26551: Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones

arXiv - AI · 4 min · about 2 hours ago

All Content

Llms

[2503.13444] VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning

VideoMind introduces a novel approach for temporal-grounded video reasoning using a Chain-of-LoRA agent, enhancing multi-modal reasoning ...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2407.17412] (PASS) Visual Prompt Locates Good Structure Sparsity through a Recurrent HyperNetwork

The paper presents PASS, a novel algorithmic framework that utilizes visual prompts to enhance structural sparsity in neural networks, im...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2405.14504] Adaptive Runge-Kutta Dynamics for Spatiotemporal Prediction

The paper presents an innovative approach using an adaptive Runge-Kutta method for spatiotemporal prediction, enhancing model accuracy in...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2211.12817] Learning to See the Elephant in the Room: Self-Supervised Context Reasoning in Humans and AI

This article explores self-supervised context reasoning in humans and AI, presenting a model called SeCo that learns contextual relations...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2510.13205] CleverCatch: A Knowledge-Guided Weak Supervision Model for Fraud Detection

CleverCatch introduces a knowledge-guided weak supervision model for detecting healthcare fraud, enhancing accuracy and interpretability ...

arXiv - AI · 4 min · about 1 month ago

Computer Vision

[2601.21468] MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning

MemOCR introduces a multimodal memory agent that enhances long-horizon reasoning by using layout-aware visual memory, optimizing context ...

arXiv - AI · 3 min · about 1 month ago

Llms

[2601.05500] The Illusion of Human AI Parity Under Uncertainty: Navigating Elusive Ground Truth via a Probabilistic Paradigm

This paper discusses the impact of uncertainty in ground truth evaluations on AI performance assessments, proposing a probabilistic frame...

arXiv - AI · 4 min · about 1 month ago

Llms

[2510.27623] BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning

The paper presents BEAT, a novel framework for executing visual backdoor attacks on Vision-Language Model (VLM)-based embodied agents, hi...

arXiv - AI · 4 min · about 1 month ago

Llms

[2510.00523] VIRTUE: Visual-Interactive Text-Image Universal Embedder

The paper presents VIRTUE, a novel Visual-Interactive Text-Image Universal Embedder that enhances multimodal representation learning by i...

arXiv - AI · 4 min · about 1 month ago

Data Science

[2509.03830] Decoding Tourist Perception in Historic Urban Quarters with Multimodal Social Media Data: An AI-Based Framework and Evidence from Shanghai

This study presents an AI-based framework to analyze tourist perceptions in historic urban quarters of Shanghai, utilizing multimodal soc...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2505.03646] GRILL: Restoring Gradient Signal in Ill-Conditioned Layers for More Effective Adversarial Attacks on Autoencoders

The paper presents GRILL, a method to enhance adversarial attacks on autoencoders by restoring gradient signals in ill-conditioned layers...

arXiv - AI · 4 min · about 1 month ago

Llms

[2602.20119] NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning

NovaPlan introduces a framework for zero-shot long-horizon manipulation in robotics, integrating video language planning with geometrical...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2602.20114] Benchmarking Unlearning for Vision Transformers

This article presents a benchmarking study on unlearning algorithms for Vision Transformers (VTs), highlighting their performance compare...

arXiv - AI · 4 min · about 1 month ago

Llms

[2602.20089] StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues

The paper presents StructXLIP, a novel approach that enhances vision-language models by integrating multimodal structural cues, improving...

arXiv - AI · 4 min · about 1 month ago

Llms

[2602.20066] HeatPrompt: Zero-Shot Vision-Language Modeling of Urban Heat Demand from Satellite Images

The paper presents HeatPrompt, a zero-shot vision-language framework for estimating urban heat demand from satellite images, enhancing en...

arXiv - AI · 3 min · about 1 month ago

Llms

[2602.20065] Multilingual Large Language Models do not comprehend all natural languages to equal degrees

This article examines the performance of multilingual large language models (LLMs) across various languages, revealing that comprehension...

arXiv - AI · 4 min · about 1 month ago

Robotics

[2602.20055] To Move or Not to Move: Constraint-based Planning Enables Zero-Shot Generalization for Interactive Navigation

This paper presents a novel constraint-based planning framework for mobile robots, enabling zero-shot generalization in interactive navig...

arXiv - AI · 4 min · about 1 month ago

Machine Learning

[2602.20159] A Very Big Video Reasoning Suite

The paper introduces the Very Big Video Reasoning (VBVR) Dataset, a large-scale resource for studying video reasoning capabilities, featu...

arXiv - Machine Learning · 4 min · about 1 month ago

Computer Vision

[2602.20051] SEAL-pose: Enhancing 3D Human Pose Estimation via a Learned Loss for Structural Consistency

The SEAL-pose framework enhances 3D human pose estimation by utilizing a learned loss function that captures structural consistency among...

arXiv - AI · 4 min · about 1 month ago

Data Science

[2602.20028] Descriptor: Dataset of Parasitoid Wasps and Associated Hymenoptera (DAPWH)

The article presents a curated dataset of parasitoid wasps and associated Hymenoptera, aimed at enhancing automated identification system...

arXiv - AI · 4 min · about 1 month ago

Previous Page 21 Next

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Subscribe to Newsletter

Daily or weekly digest • Unsubscribe anytime

Computer Vision

Top This Week

[2602.09678] Administrative Law's Fourth Settlement: AI and the Capability-Accountability Trap

[2601.13622] CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models

[2603.26551] Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones

All Content

[2503.13444] VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning

[2407.17412] (PASS) Visual Prompt Locates Good Structure Sparsity through a Recurrent HyperNetwork

[2405.14504] Adaptive Runge-Kutta Dynamics for Spatiotemporal Prediction

[2211.12817] Learning to See the Elephant in the Room: Self-Supervised Context Reasoning in Humans and AI

[2510.13205] CleverCatch: A Knowledge-Guided Weak Supervision Model for Fraud Detection

[2601.21468] MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning

[2601.05500] The Illusion of Human AI Parity Under Uncertainty: Navigating Elusive Ground Truth via a Probabilistic Paradigm

[2510.27623] BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning

[2510.00523] VIRTUE: Visual-Interactive Text-Image Universal Embedder

[2509.03830] Decoding Tourist Perception in Historic Urban Quarters with Multimodal Social Media Data: An AI-Based Framework and Evidence from Shanghai

[2505.03646] GRILL: Restoring Gradient Signal in Ill-Conditioned Layers for More Effective Adversarial Attacks on Autoencoders

[2602.20119] NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning

[2602.20114] Benchmarking Unlearning for Vision Transformers

[2602.20089] StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues

[2602.20066] HeatPrompt: Zero-Shot Vision-Language Modeling of Urban Heat Demand from Satellite Images

[2602.20065] Multilingual Large Language Models do not comprehend all natural languages to equal degrees

[2602.20055] To Move or Not to Move: Constraint-based Planning Enables Zero-Shot Generalization for Interactive Navigation

[2602.20159] A Very Big Video Reasoning Suite

[2602.20051] SEAL-pose: Enhancing 3D Human Pose Estimation via a Learned Loss for Structural Consistency

[2602.20028] Descriptor: Dataset of Parasitoid Wasps and Associated Hymenoptera (DAPWH)

Related Topics

Stay updated with AI News