Computer Vision

Image recognition, detection, and visual AI

Top This Week

[2511.09675] PriVi: Towards A General-Purpose Video Model For Primate Behavior In The Wild
Machine Learning

[2511.09675] PriVi: Towards A General-Purpose Video Model For Primate Behavior In The Wild

Abstract page for arXiv paper 2511.09675: PriVi: Towards A General-Purpose Video Model For Primate Behavior In The Wild

arXiv - Machine Learning · 4 min ·
[2509.15219] Out-of-Sight Embodied Agents: Multimodal Tracking, Sensor Fusion, and Trajectory Forecasting
Machine Learning

[2509.15219] Out-of-Sight Embodied Agents: Multimodal Tracking, Sensor Fusion, and Trajectory Forecasting

Abstract page for arXiv paper 2509.15219: Out-of-Sight Embodied Agents: Multimodal Tracking, Sensor Fusion, and Trajectory Forecasting

arXiv - Machine Learning · 4 min ·
[2603.26657] Tunable Soft Equivariance with Guarantees
Machine Learning

[2603.26657] Tunable Soft Equivariance with Guarantees

Abstract page for arXiv paper 2603.26657: Tunable Soft Equivariance with Guarantees

arXiv - Machine Learning · 3 min ·

All Content

[2602.20878] Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs
Llms

[2602.20878] Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs

This article introduces Vision-Language Causal Graphs (VLCGs) to enhance causal reasoning in Vision-Language Models (LVLMs), addressing t...

arXiv - AI · 3 min ·
[2602.20739] PyVision-RL: Forging Open Agentic Vision Models via RL
Machine Learning

[2602.20739] PyVision-RL: Forging Open Agentic Vision Models via RL

The paper introduces PyVision-RL, a reinforcement learning framework designed to enhance agentic multimodal models by preventing interact...

arXiv - AI · 3 min ·
[2602.20659] Recursive Belief Vision Language Model
Llms

[2602.20659] Recursive Belief Vision Language Model

The Recursive Belief Vision Language Model (RB-VLA) addresses limitations in current vision-language-action models by introducing a belie...

arXiv - AI · 4 min ·
A retinal reboot for amblyopia | MIT Technology Review
Computer Vision

A retinal reboot for amblyopia | MIT Technology Review

A new study reveals that anesthetizing the retina of a 'lazy' eye for two days can restore vision in mice, offering hope for treating amb...

MIT Technology Review - AI · 3 min ·
How the rail sector is adapting to an AI-enabled future
Machine Learning

How the rail sector is adapting to an AI-enabled future

The rail sector is embracing AI to enhance data processing and operational efficiency, with initiatives like Great British Railways lever...

AI News - General · 14 min ·
Anthropic Slams China for AI Theft, But Critics Say the Outrage Is Hypocritical
Nlp

Anthropic Slams China for AI Theft, But Critics Say the Outrage Is Hypocritical

Anthropic accuses Chinese developers of stealing AI secrets from its Claude chatbot, sparking criticism over its own data scraping practi...

AI Tools & Products · 7 min ·
[2602.08550] GOT-Edit: Geometry-Aware Generic Object Tracking via Online Model Editing
Machine Learning

[2602.08550] GOT-Edit: Geometry-Aware Generic Object Tracking via Online Model Editing

GOT-Edit introduces a novel approach to generic object tracking by integrating geometry-aware cues through online model editing, enhancin...

arXiv - Machine Learning · 4 min ·
[2601.16210] PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation
Generative Ai

[2601.16210] PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation

The paper introduces PyraTok, a language-aligned pyramidal tokenizer designed to enhance video understanding and generation by improving ...

arXiv - AI · 3 min ·
[2512.02700] VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm
Llms

[2512.02700] VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm

The paper presents VLM-Pruner, a novel token pruning algorithm designed to enhance the efficiency of vision-language models (VLMs) by bal...

arXiv - Machine Learning · 4 min ·
[2512.13742] DL$^3$M: A Vision-to-Language Framework for Expert-Level Medical Reasoning through Deep Learning and Large Language Models
Llms

[2512.13742] DL$^3$M: A Vision-to-Language Framework for Expert-Level Medical Reasoning through Deep Learning and Large Language Models

The DL$^3$M framework integrates deep learning and large language models to enhance medical reasoning from images, addressing limitations...

arXiv - AI · 4 min ·
[2511.07399] StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation
Machine Learning

[2511.07399] StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation

StreamDiffusionV2 presents a novel system for dynamic and interactive video generation, enhancing live streaming capabilities through opt...

arXiv - Machine Learning · 4 min ·
[2511.06450] Countering Multi-modal Representation Collapse through Rank-targeted Fusion
Machine Learning

[2511.06450] Countering Multi-modal Representation Collapse through Rank-targeted Fusion

This paper presents a novel framework, Rank-enhancing Token Fuser, to address multi-modal representation collapse in machine learning, en...

arXiv - Machine Learning · 4 min ·
[2511.16175] Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight
Machine Learning

[2511.16175] Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

The paper introduces Mantis, a Vision-Language-Action model that enhances visual foresight through a novel framework, achieving superior ...

arXiv - AI · 4 min ·
[2511.02860] AI-driven Large-scale Electron Microscopy enables Whole-tissue Subcellular Digitization
Machine Learning

[2511.02860] AI-driven Large-scale Electron Microscopy enables Whole-tissue Subcellular Digitization

The article presents DeepOrganelle, a deep learning tool that enhances large-scale electron microscopy for mapping organelle distribution...

arXiv - AI · 3 min ·
[2510.06820] Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking
Machine Learning

[2510.06820] Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking

The paper presents EDJE, an Efficient Discriminative Joint Encoder designed to enhance vision-language reranking by precomputing visual t...

arXiv - Machine Learning · 3 min ·
[2509.26287] Flower: A Flow-Matching Solver for Inverse Problems
Machine Learning

[2509.26287] Flower: A Flow-Matching Solver for Inverse Problems

The paper introduces Flower, a novel solver for linear inverse problems that utilizes a pre-trained flow model to enhance reconstruction ...

arXiv - Machine Learning · 3 min ·
[2510.14979] From Pixels to Words -- Towards Native Vision-Language Primitives at Scale
Llms

[2510.14979] From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

The paper discusses the development of native Vision-Language Models (VLMs) that integrate vision and language capabilities more effectiv...

arXiv - AI · 4 min ·
[2510.02240] RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning
Llms

[2510.02240] RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning

The paper presents RewardMap, a multi-stage reinforcement learning framework aimed at improving fine-grained visual reasoning in multimod...

arXiv - AI · 4 min ·
[2505.17779] U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding
Llms

[2505.17779] U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding

The paper introduces U2-BENCH, a benchmark for evaluating large vision-language models (LVLMs) on ultrasound understanding, addressing ch...

arXiv - Machine Learning · 4 min ·
[2509.24526] CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models
Machine Learning

[2509.24526] CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models

The paper introduces Consistency Mid-Training (CMT), a novel method for enhancing the efficiency of training flow map models, achieving s...

arXiv - Machine Learning · 4 min ·
Previous Page 19 Next

Related Topics

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime