[2505.08548] From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

[2505.08548] From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

arXiv - AI 4 min read

About this article

Abstract page for arXiv paper 2505.08548: From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

Computer Science > Robotics arXiv:2505.08548 (cs) [Submitted on 13 May 2025 (v1), last revised 5 Apr 2026 (this version, v3)] Title:From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation Authors:Yifu Yuan, Haiqin Cui, Yibin Chen, Zibin Dong, Fei Ni, Longxin Kou, Jinyi Liu, Pengyi Li, Yan Zheng, Jianye Hao View a PDF of the paper titled From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation, by Yifu Yuan and 9 other authors View PDF HTML (experimental) Abstract:Achieving generalization in robotic manipulation remains a critical challenge, particularly for unseen scenarios and novel tasks. Current Vision-Language-Action (VLA) models, while building on top of general Vision-Language Models (VLMs), still fall short of achieving robust zero-shot performance due to the scarcity and heterogeneity prevalent in embodied datasets. To address these limitations, we propose FSD (From Seeing to Doing), a novel vision-language model that generates intermediate representations through spatial relationship reasoning, providing fine-grained guidance for robotic manipulation. Our approach combines a hierarchical data pipeline for training with a self-consistency mechanism that aligns spatial coordinates with visual signals. Through extensive experiments, we comprehensively validated FSD's capabilities in both "seeing" and "doing," achieving outstanding performance across 8 benchmarks for general spatial reasoning and embodied reference abilitie...

Originally published on April 07, 2026. Curated by AI News.

Related Articles

Llms

Things I got wrong building a confidence evaluator for local LLMs [D]

I've been building **Autodidact**, a local-first AI agent framework. The central piece is a **confidence evaluator** - something that dec...

Reddit - Machine Learning · 1 min ·
Llms

I’m convinced 90% of you building "AI Agents" are just burning money on proxy providers. [D]

Seriously, I just audited my stack and realized I’m spending more on rotating residential proxies than I am on the actual Claude and Open...

Reddit - Machine Learning · 1 min ·
Llms

How do you test AI agents in production? The unpredictability is overwhelming.[D]

I’ve been in QA for almost a decade. My mental model for quality was always: given input X, assert output Y. Now I’m on a team that’s shi...

Reddit - Machine Learning · 1 min ·
Llms

Confusing Website

i'm trying to find a video online and couldn't so i asked ChatGPT by describing the video and i was given a link and i'm trying to make s...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime