Computer Vision

Image recognition, detection, and visual AI

Top This Week

[2511.21428] From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings
Machine Learning

[2511.21428] From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings

Abstract page for arXiv paper 2511.21428: From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in ...

arXiv - AI · 4 min ·
[2511.16719] SAM 3: Segment Anything with Concepts
Machine Learning

[2511.16719] SAM 3: Segment Anything with Concepts

Abstract page for arXiv paper 2511.16719: SAM 3: Segment Anything with Concepts

arXiv - AI · 4 min ·
[2603.28594] Detection of Adversarial Attacks in Robotic Perception
Machine Learning

[2603.28594] Detection of Adversarial Attacks in Robotic Perception

Abstract page for arXiv paper 2603.28594: Detection of Adversarial Attacks in Robotic Perception

arXiv - AI · 3 min ·

All Content

[2602.14178] UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model
Llms

[2602.14178] UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model

The paper presents UniWeTok, a unified binary tokenizer with a massive codebook size of 2^128, designed to enhance multimodal large langu...

arXiv - AI · 4 min ·
[2602.14177] Towards Spatial Transcriptomics-driven Pathology Foundation Models
Llms

[2602.14177] Towards Spatial Transcriptomics-driven Pathology Foundation Models

This article presents Spatial Expression-Aligned Learning (SEAL), a framework that integrates spatial transcriptomics with pathology mode...

arXiv - AI · 4 min ·
[2602.14157] When Test-Time Guidance Is Enough: Fast Image and Video Editing with Diffusion Guidance
Machine Learning

[2602.14157] When Test-Time Guidance Is Enough: Fast Image and Video Editing with Diffusion Guidance

The paper explores a novel approach to image and video editing using test-time guidance with diffusion models, achieving performance comp...

arXiv - AI · 3 min ·
[2602.14140] Detection of On-Ground Chestnuts Using Artificial Intelligence Toward Automated Picking
Computer Vision

[2602.14140] Detection of On-Ground Chestnuts Using Artificial Intelligence Toward Automated Picking

This article presents a study on using AI for detecting chestnuts on the ground to improve automated harvesting, highlighting the effecti...

arXiv - AI · 4 min ·
[2602.14134] DenseMLLM: Standard Multimodal LLMs are Intrinsic Dense Predictors
Llms

[2602.14134] DenseMLLM: Standard Multimodal LLMs are Intrinsic Dense Predictors

The paper introduces DenseMLLM, a multimodal large language model designed to perform dense predictions without the need for complex, tas...

arXiv - AI · 3 min ·
[2602.13930] MamaDino: A Hybrid Vision Model for Breast Cancer 3-Year Risk Prediction
Machine Learning

[2602.13930] MamaDino: A Hybrid Vision Model for Breast Cancer 3-Year Risk Prediction

MamaDino is a novel hybrid vision model that enhances breast cancer risk prediction by utilizing lower-resolution mammograms while mainta...

arXiv - Machine Learning · 4 min ·
[2602.14099] SemanticFeels: Semantic Labeling during In-Hand Manipulation
Robotics

[2602.14099] SemanticFeels: Semantic Labeling during In-Hand Manipulation

The paper presents SemanticFeels, a novel framework for semantic labeling during in-hand manipulation, enhancing robots' ability to class...

arXiv - AI · 3 min ·
[2602.13889] Parameter-Efficient Fine-Tuning of DINOv2 for Large-Scale Font Classification
Machine Learning

[2602.13889] Parameter-Efficient Fine-Tuning of DINOv2 for Large-Scale Font Classification

The paper presents a novel approach to font classification using DINOv2, achieving high accuracy with minimal parameter tuning and introd...

arXiv - Machine Learning · 3 min ·
[2602.14073] Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework
Llms

[2602.14073] Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework

This article presents a methodology for adapting vision-language models to the Polish language using the LLaVA framework, demonstrating s...

arXiv - AI · 4 min ·
[2602.14042] Restoration Adaptation for Semantic Segmentation on Low Quality Images
Machine Learning

[2602.14042] Restoration Adaptation for Semantic Segmentation on Low Quality Images

This paper presents a novel approach, Restoration Adaptation for Semantic Segmentation (RASS), which enhances semantic segmentation perfo...

arXiv - AI · 4 min ·
[2602.14041] BitDance: Scaling Autoregressive Generative Models with Binary Tokens
Machine Learning

[2602.14041] BitDance: Scaling Autoregressive Generative Models with Binary Tokens

BitDance introduces a novel autoregressive image generator that utilizes binary tokens for enhanced efficiency and performance in generat...

arXiv - AI · 4 min ·
[2602.13818] VAR-3D: View-aware Auto-Regressive Model for Text-to-3D Generation via a 3D Tokenizer
Machine Learning

[2602.13818] VAR-3D: View-aware Auto-Regressive Model for Text-to-3D Generation via a 3D Tokenizer

The VAR-3D model introduces a novel approach to text-to-3D generation, addressing challenges in discrete 3D representation and enhancing ...

arXiv - Machine Learning · 3 min ·
[2602.14010] A Deployment-Friendly Foundational Framework for Efficient Computational Pathology
Llms

[2602.14010] A Deployment-Friendly Foundational Framework for Efficient Computational Pathology

This paper presents LitePath, a foundational framework for computational pathology that significantly reduces computational costs while m...

arXiv - AI · 4 min ·
[2602.13712] Fine-tuned Vision Language Model for Localization of Parasitic Eggs in Microscopic Images
Llms

[2602.13712] Fine-tuned Vision Language Model for Localization of Parasitic Eggs in Microscopic Images

This paper presents a fine-tuned Vision Language Model (VLM) designed for the localization of parasitic eggs in microscopic images, demon...

arXiv - Machine Learning · 3 min ·
[2602.13602] Towards Sparse Video Understanding and Reasoning
Llms

[2602.13602] Towards Sparse Video Understanding and Reasoning

The paper introduces evise, a multi-round agent designed for video question answering (VQA) that enhances efficiency by selecting inform...

arXiv - Machine Learning · 3 min ·
[2602.13901] RPGD: RANSAC-P3P Gradient Descent for Extrinsic Calibration in 3D Human Pose Estimation
Computer Vision

[2602.13901] RPGD: RANSAC-P3P Gradient Descent for Extrinsic Calibration in 3D Human Pose Estimation

The paper presents RPGD, a novel framework for extrinsic calibration in 3D human pose estimation, combining RANSAC-P3P and gradient desce...

arXiv - Machine Learning · 3 min ·
[2602.13842] Automated Prediction of Paravalvular Regurgitation before Transcatheter Aortic Valve Implantation
Machine Learning

[2602.13842] Automated Prediction of Paravalvular Regurgitation before Transcatheter Aortic Valve Implantation

This paper explores the use of deep learning to predict paravalvular regurgitation (PVR) in patients undergoing Transcatheter Aortic Valv...

arXiv - AI · 4 min ·
[2602.13378] LAF-YOLOv10 with Partial Convolution Backbone, Attention-Guided Feature Pyramid, Auxiliary P2 Head, and Wise-IoU Loss for Small Object Detection in Drone Aerial Imagery
Computer Vision

[2602.13378] LAF-YOLOv10 with Partial Convolution Backbone, Attention-Guided Feature Pyramid, Auxiliary P2 Head, and Wise-IoU Loss for Small Object Detection in Drone Aerial Imagery

The paper presents LAF-YOLOv10, an advanced model for small object detection in drone imagery, integrating techniques like Partial Convol...

arXiv - Machine Learning · 4 min ·
[2602.13414] FUTON: Fourier Tensor Network for Implicit Neural Representations
Machine Learning

[2602.13414] FUTON: Fourier Tensor Network for Implicit Neural Representations

The paper introduces FUTON, a Fourier Tensor Network designed to enhance implicit neural representations (INRs) by improving convergence ...

arXiv - Machine Learning · 3 min ·
[2602.13334] Ask the Expert: Collaborative Inference for Vision Transformers with Near-Edge Accelerators
Machine Learning

[2602.13334] Ask the Expert: Collaborative Inference for Vision Transformers with Near-Edge Accelerators

This article presents a collaborative inference framework for deploying Vision Transformers on edge devices, addressing computational cha...

arXiv - Machine Learning · 3 min ·
Previous Page 41 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime