[2602.20687] How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective
Summary
This article discusses the limitations of current benchmarks for vision-language model (VLM)-driven embodied agents and introduces NativeEmbodied, a new benchmark that evaluates agents using a unified low-level action space.
Why It Matters
The research highlights critical gaps in the evaluation of VLM-based agents, emphasizing the need for a more nuanced understanding of foundational skills. By introducing NativeEmbodied, it sets a new standard for assessing embodied intelligence, which is essential for advancing AI capabilities in real-world applications.
Key Takeaways
- Current benchmarks for VLM-driven agents often overlook low-level skills.
- NativeEmbodied introduces a unified low-level action space for evaluation.
- The study reveals significant deficiencies in fundamental embodied skills.
- Joint evaluation across task and skill levels provides deeper insights.
- Findings can guide future research and development in embodied AI.
Computer Science > Artificial Intelligence arXiv:2602.20687 (cs) [Submitted on 24 Feb 2026] Title:How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective Authors:Bo Peng, Pi Bu, Keyu Pan, Xinrun Xu, Yinxiu Zhao, Miao Chen, Yang Du, Lin Li, Jun Song, Tong Xu View a PDF of the paper titled How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective, by Bo Peng and 9 other authors View PDF HTML (experimental) Abstract:Recent advances in vision-language models (VLMs) have shown promise for human-level embodied intelligence. However, existing benchmarks for VLM-driven embodied agents often rely on high-level commands or discretized action spaces, which are non-native settings that differ markedly from real-world control. In addition, current benchmarks focus primarily on high-level tasks and lack joint evaluation and analysis at both low and high levels. To address these limitations, we present NativeEmbodied, a challenging benchmark for VLM-driven embodied agents that uses a unified, native low-level action space. Built on diverse simulated scenes, NativeEmbodied includes three representative high-level tasks in complex scenarios to evaluate overall performance. For more detailed analysis, we further decouple the skills required by complex tasks and construct four types of low-level tasks, each targeting a fundamental embodied skill. This joint evaluation across task and skill granularities enables fine-grained assessment of embodi...