[2603.25091] Pixelis: Reasoning in Pixels, from Seeing to Acting

[2603.25091] Pixelis: Reasoning in Pixels, from Seeing to Acting

arXiv - AI 4 min read

About this article

Abstract page for arXiv paper 2603.25091: Pixelis: Reasoning in Pixels, from Seeing to Acting

Computer Science > Computer Vision and Pattern Recognition arXiv:2603.25091 (cs) [Submitted on 26 Mar 2026] Title:Pixelis: Reasoning in Pixels, from Seeing to Acting Authors:Yunpeng Zhou View a PDF of the paper titled Pixelis: Reasoning in Pixels, from Seeing to Acting, by Yunpeng Zhou View PDF HTML (experimental) Abstract:Most vision-language systems are static observers: they describe pixels, do not act, and cannot safely improve under shift. This passivity limits generalizable, physically grounded visual intelligence. Learning through action, not static description, is essential beyond curated data. We present Pixelis, a pixel-space agent that operates directly on images and videos via a compact set of executable operations (zoom/crop, segment, track, OCR, temporal localization) and learns from its consequences. Pixelis trains in three phases: (1) Supervised Fine-Tuning learns a pixel-tool grammar from Chain-of-Thought-Action traces with a masked imitation loss that upweights operation/argument tokens and auxiliary heads to stabilize pixel-grounded arguments; (2) Curiosity-Coherence Reward Fine-Tuning optimizes a dual-drive objective marrying prediction-error curiosity with adjacent-step coherence and a mild efficiency prior under a KL anchor, yielding short, valid, structured toolchains; (3) Pixel Test-Time RL performs label-free adaptation by retrieving neighbors, voting over complete trajectories rather than answers, and updating toward short, high-fidelity exemplars...

Originally published on March 27, 2026. Curated by AI News.

Related Articles

[2506.22504] Patch2Loc: Learning to Localize Patches for Unsupervised Brain Lesion Detection
Machine Learning

[2506.22504] Patch2Loc: Learning to Localize Patches for Unsupervised Brain Lesion Detection

Abstract page for arXiv paper 2506.22504: Patch2Loc: Learning to Localize Patches for Unsupervised Brain Lesion Detection

arXiv - Machine Learning · 4 min ·
[2508.00307] Acoustic Imaging for Low-SNR UAV Detection: Dense Beamformed Energy Maps and U-Net SELD
Machine Learning

[2508.00307] Acoustic Imaging for Low-SNR UAV Detection: Dense Beamformed Energy Maps and U-Net SELD

Abstract page for arXiv paper 2508.00307: Acoustic Imaging for Low-SNR UAV Detection: Dense Beamformed Energy Maps and U-Net SELD

arXiv - AI · 4 min ·
[2603.25524] CHIRP dataset: towards long-term, individual-level, behavioral monitoring of bird populations in the wild
Computer Vision

[2603.25524] CHIRP dataset: towards long-term, individual-level, behavioral monitoring of bird populations in the wild

Abstract page for arXiv paper 2603.25524: CHIRP dataset: towards long-term, individual-level, behavioral monitoring of bird populations i...

arXiv - AI · 4 min ·
[2603.25170] Knowledge-Guided Adversarial Training for Infrared Object Detection via Thermal Radiation Modeling
Machine Learning

[2603.25170] Knowledge-Guided Adversarial Training for Infrared Object Detection via Thermal Radiation Modeling

Abstract page for arXiv paper 2603.25170: Knowledge-Guided Adversarial Training for Infrared Object Detection via Thermal Radiation Modeling

arXiv - AI · 4 min ·
More in Computer Vision: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime