[2505.08548] From Seeing to Doing: Bridging Reasoning and Decision for

[2505.08548] From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

arXiv - AI April 07, 2026 4 min read

About this article

Abstract page for arXiv paper 2505.08548: From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

Computer Science > Robotics arXiv:2505.08548 (cs) [Submitted on 13 May 2025 (v1), last revised 5 Apr 2026 (this version, v3)] Title:From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation Authors:Yifu Yuan, Haiqin Cui, Yibin Chen, Zibin Dong, Fei Ni, Longxin Kou, Jinyi Liu, Pengyi Li, Yan Zheng, Jianye Hao View a PDF of the paper titled From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation, by Yifu Yuan and 9 other authors View PDF HTML (experimental) Abstract:Achieving generalization in robotic manipulation remains a critical challenge, particularly for unseen scenarios and novel tasks. Current Vision-Language-Action (VLA) models, while building on top of general Vision-Language Models (VLMs), still fall short of achieving robust zero-shot performance due to the scarcity and heterogeneity prevalent in embodied datasets. To address these limitations, we propose FSD (From Seeing to Doing), a novel vision-language model that generates intermediate representations through spatial relationship reasoning, providing fine-grained guidance for robotic manipulation. Our approach combines a hierarchical data pipeline for training with a self-consistency mechanism that aligns spatial coordinates with visual signals. Through extensive experiments, we comprehensively validated FSD's capabilities in both "seeing" and "doing," achieving outstanding performance across 8 benchmarks for general spatial reasoning and embodied reference abilitie...

Originally published on April 07, 2026. Curated by AI News.

Llms

Things I got wrong building a confidence evaluator for local LLMs [D]

I've been building **Autodidact**, a local-first AI agent framework. The central piece is a **confidence evaluator** - something that dec...

Reddit - Machine Learning · 1 min · 20 minutes ago

Llms

I’m convinced 90% of you building "AI Agents" are just burning money on proxy providers. [D]

Seriously, I just audited my stack and realized I’m spending more on rotating residential proxies than I am on the actual Claude and Open...

Reddit - Machine Learning · 1 min · 20 minutes ago

Llms

How do you test AI agents in production? The unpredictability is overwhelming.[D]

I’ve been in QA for almost a decade. My mental model for quality was always: given input X, assert output Y. Now I’m on a team that’s shi...

Reddit - Machine Learning · 1 min · about 3 hours ago

Llms

Confusing Website

i'm trying to find a video online and couldn't so i asked ChatGPT by describing the video and i was given a link and i'm trying to make s...

Reddit - Artificial Intelligence · 1 min · about 4 hours ago

[2505.08548] From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

About this article

Related Articles

Things I got wrong building a confidence evaluator for local LLMs [D]

I’m convinced 90% of you building "AI Agents" are just burning money on proxy providers. [D]

How do you test AI agents in production? The unpredictability is overwhelming.[D]

Confusing Website

No comments

Stay updated with AI News