Computer Vision

Image recognition, detection, and visual AI

Top This Week

[2511.21428] From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings
Machine Learning

[2511.21428] From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings

Abstract page for arXiv paper 2511.21428: From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in ...

arXiv - AI · 4 min ·
[2511.16719] SAM 3: Segment Anything with Concepts
Machine Learning

[2511.16719] SAM 3: Segment Anything with Concepts

Abstract page for arXiv paper 2511.16719: SAM 3: Segment Anything with Concepts

arXiv - AI · 4 min ·
[2603.28594] Detection of Adversarial Attacks in Robotic Perception
Machine Learning

[2603.28594] Detection of Adversarial Attacks in Robotic Perception

Abstract page for arXiv paper 2603.28594: Detection of Adversarial Attacks in Robotic Perception

arXiv - AI · 3 min ·

All Content

[2602.13322] Diagnostic Benchmarks for Invariant Learning Dynamics: Empirical Validation of the Eidos Architecture
Data Science

[2602.13322] Diagnostic Benchmarks for Invariant Learning Dynamics: Empirical Validation of the Eidos Architecture

This paper presents the PolyShapes-Ideal (PSI) dataset and diagnostic benchmarks for evaluating topological invariance in machine learnin...

arXiv - Machine Learning · 3 min ·
[2602.13758] OmniScience: A Large-scale Multi-modal Dataset for Scientific Image Understanding
Llms

[2602.13758] OmniScience: A Large-scale Multi-modal Dataset for Scientific Image Understanding

The paper introduces OmniScience, a large-scale multi-modal dataset designed to enhance scientific image understanding in AI models, addr...

arXiv - AI · 4 min ·
[2602.13297] Conditional Generative Models for High-Resolution Range Profiles: Capturing Geometry-Driven Trends in a Large-Scale Maritime Dataset
Machine Learning

[2602.13297] Conditional Generative Models for High-Resolution Range Profiles: Capturing Geometry-Driven Trends in a Large-Scale Maritime Dataset

This paper explores the use of conditional generative models to synthesize high-resolution range profiles (HRRPs) for maritime surveillan...

arXiv - Machine Learning · 3 min ·
[2602.13681] An Ensemble Learning Approach towards Waste Segmentation in Cluttered Environment
Computer Vision

[2602.13681] An Ensemble Learning Approach towards Waste Segmentation in Cluttered Environment

This article presents an Ensemble Learning approach to enhance waste segmentation accuracy in cluttered environments, crucial for improvi...

arXiv - AI · 4 min ·
[2602.13296] MFN Decomposition and Related Metrics for High-Resolution Range Profiles Generative Models
Machine Learning

[2602.13296] MFN Decomposition and Related Metrics for High-Resolution Range Profiles Generative Models

This paper presents a novel approach to evaluating high-resolution range profile (HRRP) data using MFN decomposition, addressing challeng...

arXiv - Machine Learning · 3 min ·
[2602.13662] LeafNet: A Large-Scale Dataset and Comprehensive Benchmark for Foundational Vision-Language Understanding of Plant Diseases
Llms

[2602.13662] LeafNet: A Large-Scale Dataset and Comprehensive Benchmark for Foundational Vision-Language Understanding of Plant Diseases

LeafNet introduces a large-scale dataset and benchmark for evaluating vision-language models in plant disease diagnosis, highlighting sig...

arXiv - AI · 4 min ·
[2602.13650] KorMedMCQA-V: A Multimodal Benchmark for Evaluating Vision-Language Models on the Korean Medical Licensing Examination
Llms

[2602.13650] KorMedMCQA-V: A Multimodal Benchmark for Evaluating Vision-Language Models on the Korean Medical Licensing Examination

The article presents KorMedMCQA-V, a benchmark dataset for evaluating vision-language models on the Korean Medical Licensing Examination,...

arXiv - AI · 4 min ·
[2602.13588] Two-Stream Interactive Joint Learning of Scene Parsing and Geometric Vision Tasks
Computer Vision

[2602.13588] Two-Stream Interactive Joint Learning of Scene Parsing and Geometric Vision Tasks

The paper presents TwInS, a novel framework for joint learning of scene parsing and geometric vision tasks, inspired by the human visual ...

arXiv - AI · 4 min ·
[2602.13555] Privacy-Concealing Cooperative Perception for BEV Scene Segmentation
Computer Vision

[2602.13555] Privacy-Concealing Cooperative Perception for BEV Scene Segmentation

The paper presents a Privacy-Concealing Cooperation (PCC) framework for Bird's Eye View (BEV) semantic segmentation, enhancing autonomous...

arXiv - AI · 4 min ·
[2602.14889] Web-Scale Multimodal Summarization using CLIP-Based Semantic Alignment
Machine Learning

[2602.14889] Web-Scale Multimodal Summarization using CLIP-Based Semantic Alignment

The paper presents a framework for web-scale multimodal summarization that integrates text and image data using CLIP-based semantic align...

arXiv - Machine Learning · 3 min ·
[2602.13469] How Multimodal Large Language Models Support Access to Visual Information: A Diary Study With Blind and Low Vision People
Llms

[2602.13469] How Multimodal Large Language Models Support Access to Visual Information: A Diary Study With Blind and Low Vision People

This article explores how multimodal large language models (MLLMs) enhance access to visual information for blind and low vision individu...

arXiv - AI · 4 min ·
[2602.13444] FlowHOI: Flow-based Semantics-Grounded Generation of Hand-Object Interactions for Dexterous Robot Manipulation
Machine Learning

[2602.13444] FlowHOI: Flow-based Semantics-Grounded Generation of Hand-Object Interactions for Dexterous Robot Manipulation

FlowHOI presents a novel framework for generating hand-object interactions in robotic manipulation, enhancing the realism and efficiency ...

arXiv - AI · 4 min ·
[2602.13376] An Online Reference-Free Evaluation Framework for Flowchart Image-to-Code Generation
Llms

[2602.13376] An Online Reference-Free Evaluation Framework for Flowchart Image-to-Code Generation

This article presents a novel reference-free evaluation framework for assessing the quality of flowchart image-to-code generation, utiliz...

arXiv - AI · 3 min ·
[2602.13357] AdaCorrection: Adaptive Offset Cache Correction for Accurate Diffusion Transformers
Machine Learning

[2602.13357] AdaCorrection: Adaptive Offset Cache Correction for Accurate Diffusion Transformers

The paper introduces AdaCorrection, a framework that enhances the efficiency of Diffusion Transformers by correcting cache misalignment, ...

arXiv - AI · 3 min ·
[2602.13352] Using Deep Learning to Generate Semantically Correct Hindi Captions
Machine Learning

[2602.13352] Using Deep Learning to Generate Semantically Correct Hindi Captions

This article explores the use of deep learning techniques to generate semantically accurate image captions in Hindi, utilizing advanced m...

arXiv - AI · 4 min ·
[2602.13350] Detecting Brick Kiln Infrastructure at Scale: Graph, Foundation, and Remote Sensing Models for Satellite Imagery Data
Machine Learning

[2602.13350] Detecting Brick Kiln Infrastructure at Scale: Graph, Foundation, and Remote Sensing Models for Satellite Imagery Data

This paper presents a novel approach to detecting brick kiln infrastructure using high-resolution satellite imagery, focusing on a new mo...

arXiv - AI · 4 min ·
[2602.13349] From Prompt to Production:Automating Brand-Safe Marketing Imagery with Text-to-Image Models
Machine Learning

[2602.13349] From Prompt to Production:Automating Brand-Safe Marketing Imagery with Text-to-Image Models

This paper discusses a new automated pipeline for generating brand-safe marketing imagery using text-to-image models, balancing automatio...

arXiv - AI · 3 min ·
[2602.13347] Visual Foresight for Robotic Stow: A Diffusion-Based World Model from Sparse Snapshots
Machine Learning

[2602.13347] Visual Foresight for Robotic Stow: A Diffusion-Based World Model from Sparse Snapshots

The paper presents FOREST, a diffusion-based world model for robotic stow operations, enhancing the prediction of post-stow configuration...

arXiv - AI · 3 min ·
[2602.13339] An Integrated Causal Inference Framework for Traffic Safety Modeling with Semantic Street-View Visual Features
Machine Learning

[2602.13339] An Integrated Causal Inference Framework for Traffic Safety Modeling with Semantic Street-View Visual Features

This article presents a novel causal inference framework for traffic safety modeling, utilizing semantic features from street-view images...

arXiv - AI · 4 min ·
[2602.13332] MedScope: Incentivizing "Think with Videos" for Clinical Reasoning via Coarse-to-Fine Tool Calling
Llms

[2602.13332] MedScope: Incentivizing "Think with Videos" for Clinical Reasoning via Coarse-to-Fine Tool Calling

The paper presents MedScope, a clinical video reasoning model that enhances decision-making in medical contexts by integrating tool use a...

arXiv - AI · 4 min ·
Previous Page 42 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime