[2603.26737] Beyond Static Visual Tokens: Structured Sequential Visual

[2603.26737] Beyond Static Visual Tokens: Structured Sequential Visual Chain-of-Thought Reasoning

arXiv - AI March 31, 2026 3 min read

About this article

Abstract page for arXiv paper 2603.26737: Beyond Static Visual Tokens: Structured Sequential Visual Chain-of-Thought Reasoning

Computer Science > Computer Vision and Pattern Recognition arXiv:2603.26737 (cs) [Submitted on 21 Mar 2026] Title:Beyond Static Visual Tokens: Structured Sequential Visual Chain-of-Thought Reasoning Authors:Guangfu Guo, Xiaoqian Lu, Yue Feng, Mingming Sun View a PDF of the paper titled Beyond Static Visual Tokens: Structured Sequential Visual Chain-of-Thought Reasoning, by Guangfu Guo and Xiaoqian Lu and Yue Feng and Mingming Sun View PDF HTML (experimental) Abstract:Current multimodal LLMs encode images as static visual prefixes and rely on text-based reasoning, lacking goal-driven and adaptive visual access. Inspired by human visual perception-where attention is selectively and sequentially shifted from the most informative regions to secondary cues-we propose Structural Sequential Visual CoT SSV-CoT. First, a question-relevant saliency map identifies and organizes key visual regions, explicitly modeling the spatial distribution of visual importance. Second, reasoning is performed following this discriminative order, inducing a curriculum-like semantic progression from primary to secondary cues. This method is trained end-to-end, using text cot and answer supervision, without relying on region-level annotations or specialized external tools. Experiments on diverse visual reasoning benchmarks show gains, validating structured and sequential visual cognition. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.2673...

Originally published on March 31, 2026. Curated by AI News.

Llms

Nvidia goes all-in on AI agents while Anthropic pulls the plug

TLDR: Nvidia is partnering with 17 major companies to build a platform specifically for enterprise AI agents, basically trying to become ...

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

Llms

Anthropic says Claude Code subscribers will need to pay extra for OpenClaw usage | TechCrunch

It’s about to become more expensive for Claude Code subscribers to use Anthropic’s coding assistant with OpenClaw and other third-party t...

TechCrunch - AI · 4 min · about 2 hours ago

Llms

I am seeing Claude everywhere

Every single Instagram reel or TikTok I scroll i see people mentioning Claude and glazing it like it’s some kind of master tool that’s be...

Reddit - Artificial Intelligence · 1 min · about 3 hours ago

Llms

Claude Opus 4.6 API at 40% below Anthropic pricing – try free before you pay anything

Hey everyone I've set up a self-hosted API gateway using [New-API](QuantumNous/new-ap) to manage and distribute Claude Opus 4.6 access ac...

Reddit - Artificial Intelligence · 1 min · about 6 hours ago

[2603.26737] Beyond Static Visual Tokens: Structured Sequential Visual Chain-of-Thought Reasoning

About this article

Related Articles

Nvidia goes all-in on AI agents while Anthropic pulls the plug

Anthropic says Claude Code subscribers will need to pay extra for OpenClaw usage | TechCrunch

I am seeing Claude everywhere

Claude Opus 4.6 API at 40% below Anthropic pricing – try free before you pay anything

No comments

Stay updated with AI News