Computer Vision

Image recognition, detection, and visual AI

Top This Week

[2602.09678] Administrative Law's Fourth Settlement: AI and the Capability-Accountability Trap
Computer Vision

[2602.09678] Administrative Law's Fourth Settlement: AI and the Capability-Accountability Trap

Abstract page for arXiv paper 2602.09678: Administrative Law's Fourth Settlement: AI and the Capability-Accountability Trap

arXiv - AI · 4 min ·
[2601.13622] CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models
Llms

[2601.13622] CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models

Abstract page for arXiv paper 2601.13622: CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language...

arXiv - AI · 3 min ·
[2603.26551] Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones
Computer Vision

[2603.26551] Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones

Abstract page for arXiv paper 2603.26551: Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones

arXiv - AI · 4 min ·

All Content

[2602.16305] BAT: Better Audio Transformer Guided by Convex Gated Probing
Machine Learning

[2602.16305] BAT: Better Audio Transformer Guided by Convex Gated Probing

The paper introduces the Better Audio Transformer (BAT), which utilizes a novel Convex Gated Probing method to enhance audio self-supervi...

arXiv - Machine Learning · 3 min ·
[2602.16430] Designing Production-Scale OCR for India: Multilingual and Domain-Specific Systems
Llms

[2602.16430] Designing Production-Scale OCR for India: Multilingual and Domain-Specific Systems

This article discusses the development of production-scale Optical Character Recognition (OCR) systems tailored for India's multilingual ...

arXiv - AI · 3 min ·
[2602.16422] Automated Histopathology Report Generation via Pyramidal Feature Extraction and the UNI Foundation Model
Llms

[2602.16422] Automated Histopathology Report Generation via Pyramidal Feature Extraction and the UNI Foundation Model

This article presents a novel framework for generating histopathology reports using a combination of a foundation model and a Transformer...

arXiv - AI · 4 min ·
[2602.16356] Articulated 3D Scene Graphs for Open-World Mobile Manipulation
Robotics

[2602.16356] Articulated 3D Scene Graphs for Open-World Mobile Manipulation

This paper presents MoMa-SG, a framework for creating semantic-kinematic 3D scene graphs to enhance mobile manipulation of articulated ob...

arXiv - AI · 4 min ·
[2602.16334] Spatial Audio Question Answering and Reasoning on Dynamic Source Movements
Machine Learning

[2602.16334] Spatial Audio Question Answering and Reasoning on Dynamic Source Movements

This article presents a study on Spatial Audio Question Answering (Spatial AQA) focusing on dynamic sound source movements, introducing i...

arXiv - AI · 4 min ·
[2602.16322] A Self-Supervised Approach for Enhanced Feature Representations in Object Detection Tasks
Machine Learning

[2602.16322] A Self-Supervised Approach for Enhanced Feature Representations in Object Detection Tasks

This paper presents a self-supervised learning approach to enhance feature representations in object detection tasks, reducing the need f...

arXiv - AI · 3 min ·
[2602.16132] CHAI: CacHe Attention Inference for text2video
Machine Learning

[2602.16132] CHAI: CacHe Attention Inference for text2video

The paper presents CHAI, a novel approach to enhance text-to-video generation by utilizing Cache Attention for efficient inference, achie...

arXiv - Machine Learning · 3 min ·
[2602.16086] LGQ: Learning Discretization Geometry for Scalable and Stable Image Tokenization
Nlp

[2602.16086] LGQ: Learning Discretization Geometry for Scalable and Stable Image Tokenization

The paper presents LGQ, a novel image tokenizer that learns discretization geometry to enhance scalability and stability in visual genera...

arXiv - Machine Learning · 4 min ·
[2602.15950] Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families
Llms

[2602.15950] Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families

This article investigates the limitations of vision-language models (VLMs) in spatial reasoning, particularly their struggle to localize ...

arXiv - Machine Learning · 4 min ·
[2602.16110] OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis
Computer Vision

[2602.16110] OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis

The paper presents OmniCT, a unified slice-volume large vision-language model (LVLM) designed for comprehensive CT analysis, addressing l...

arXiv - AI · 4 min ·
[2602.15926] A Study on Real-time Object Detection using Deep Learning
Machine Learning

[2602.15926] A Study on Real-time Object Detection using Deep Learning

This article explores real-time object detection using deep learning, detailing various algorithms, applications, and future research dir...

arXiv - Machine Learning · 4 min ·
[2602.16073] ScenicRules: An Autonomous Driving Benchmark with Multi-Objective Specifications and Abstract Scenarios
Machine Learning

[2602.16073] ScenicRules: An Autonomous Driving Benchmark with Multi-Objective Specifications and Abstract Scenarios

The paper presents ScenicRules, a benchmark for evaluating autonomous driving systems that balances multiple objectives like safety and e...

arXiv - AI · 4 min ·
[2602.16019] MedProbCLIP: Probabilistic Adaptation of Vision-Language Foundation Model for Reliable Radiograph-Report Retrieval
Llms

[2602.16019] MedProbCLIP: Probabilistic Adaptation of Vision-Language Foundation Model for Reliable Radiograph-Report Retrieval

The paper presents MedProbCLIP, a probabilistic framework for enhancing the reliability of radiograph-report retrieval using vision-langu...

arXiv - AI · 4 min ·
[2602.15872] MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models
Llms

[2602.15872] MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models

The paper presents MARVL, a novel approach for robotic manipulation that utilizes Vision-Language Models (VLMs) to enhance task performan...

arXiv - Machine Learning · 3 min ·
[2602.15959] Position-Aware Scene-Appearance Disentanglement for Bidirectional Photoacoustic Microscopy Registration
Ai Safety

[2602.15959] Position-Aware Scene-Appearance Disentanglement for Bidirectional Photoacoustic Microscopy Registration

This paper presents GPEReg-Net, a novel framework for improving image registration in bidirectional photoacoustic microscopy by disentang...

arXiv - AI · 3 min ·
[2602.15958] DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting
Data Science

[2602.15958] DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting

The paper introduces DocSplit, a benchmark dataset and evaluation framework for document packet recognition and splitting, addressing cha...

arXiv - AI · 4 min ·
[2602.15918] EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery
Llms

[2602.15918] EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery

The paper presents EarthSpatialBench, a benchmark designed to evaluate spatial reasoning capabilities of multimodal large language models...

arXiv - AI · 4 min ·
[2602.15915] MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering
Machine Learning

[2602.15915] MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering

The paper presents MaS-VQA, a novel framework for Knowledge-Based Visual Question Answering that enhances answer accuracy by integrating ...

arXiv - AI · 3 min ·
[2602.15913] Foundation Models for Medical Imaging: Status, Challenges, and Directions
Llms

[2602.15913] Foundation Models for Medical Imaging: Status, Challenges, and Directions

This article reviews the current landscape of foundation models (FMs) in medical imaging, discussing their design principles, application...

arXiv - AI · 3 min ·
[2602.15892] Egocentric Bias in Vision-Language Models
Llms

[2602.15892] Egocentric Bias in Vision-Language Models

The paper introduces FlipSet, a benchmark for assessing visual perspective taking in vision-language models, revealing significant egocen...

arXiv - AI · 3 min ·
Previous Page 33 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime