[2603.25091] Pixelis: Reasoning in Pixels, from Seeing to Acting
About this article
Abstract page for arXiv paper 2603.25091: Pixelis: Reasoning in Pixels, from Seeing to Acting
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.25091 (cs) [Submitted on 26 Mar 2026] Title:Pixelis: Reasoning in Pixels, from Seeing to Acting Authors:Yunpeng Zhou View a PDF of the paper titled Pixelis: Reasoning in Pixels, from Seeing to Acting, by Yunpeng Zhou View PDF HTML (experimental) Abstract:Most vision-language systems are static observers: they describe pixels, do not act, and cannot safely improve under shift. This passivity limits generalizable, physically grounded visual intelligence. Learning through action, not static description, is essential beyond curated data. We present Pixelis, a pixel-space agent that operates directly on images and videos via a compact set of executable operations (zoom/crop, segment, track, OCR, temporal localization) and learns from its consequences. Pixelis trains in three phases: (1) Supervised Fine-Tuning learns a pixel-tool grammar from Chain-of-Thought-Action traces with a masked imitation loss that upweights operation/argument tokens and auxiliary heads to stabilize pixel-grounded arguments; (2) Curiosity-Coherence Reward Fine-Tuning optimizes a dual-drive objective marrying prediction-error curiosity with adjacent-step coherence and a mild efficiency prior under a KL anchor, yielding short, valid, structured toolchains; (3) Pixel Test-Time RL performs label-free adaptation by retrieving neighbors, voting over complete trajectories rather than answers, and updating toward short, high-fidelity exemplars...