[2511.23055] MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents
Summary
The paper presents MindPower, a framework that enhances embodied agents' decision-making by integrating Theory of Mind (ToM) reasoning, outperforming existing models in action generation.
Why It Matters
This research addresses a significant gap in AI by enabling embodied agents to understand and infer both their own and others' mental states, which is crucial for developing more intelligent and autonomous systems. The introduction of Mind-Reward as an optimization objective further enhances the model's capabilities, making it relevant for advancements in AI applications.
Key Takeaways
- MindPower integrates Theory of Mind reasoning into embodied agents.
- The framework improves decision-making and action generation by modeling self and others' mental states.
- Mind-Reward optimizes the agents' reasoning consistency.
- MindPower outperforms GPT-4o in key performance metrics.
- This research paves the way for more advanced AI interactions.
Computer Science > Artificial Intelligence arXiv:2511.23055 (cs) [Submitted on 28 Nov 2025 (v1), last revised 24 Feb 2026 (this version, v2)] Title:MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents Authors:Ruoxuan Zhang, Qiyun Zheng, Zhiyu Zhou, Ziqi Liao, Siyu Wu, Jian-Yu Jiang-Lin, Bin Wen, Hongxia Xie, Jianlong Fu, Wen-Huang Cheng View a PDF of the paper titled MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents, by Ruoxuan Zhang and 9 other authors View PDF HTML (experimental) Abstract:Theory of Mind (ToM) refers to the ability to infer others' mental states, such as beliefs, desires, and intentions. Current vision-language embodied agents lack ToM-based decision-making, and existing benchmarks focus solely on human mental states while ignoring the agent's own perspective, hindering coherent decision and action generation. To address this, we propose MindPower, a Robot-Centric framework integrating Perception, Mental Reasoning, Decision Making and Action. Given multimodal inputs, MindPower first perceives the environment and human states, then performs ToM Reasoning to model both self and others, and finally generates decisions and actions guided by inferred mental states. Furthermore, we introduce Mind-Reward, a novel optimization objective that encourages VLMs to produce consistent ToM Reasoning and behavior. Our model outperforms GPT-4o by 12.77% in decision making and 12.49% in action generation. Comments: Subjects: Art...