[2602.20219] An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction
Summary
This article presents a novel multimodal framework for human-robot interaction that integrates video and speech processing with large language models, enhancing command execution accuracy and adaptability in robotic systems.
Why It Matters
As human-robot interaction becomes increasingly prevalent, developing systems that can accurately interpret human intent is crucial for seamless collaboration. This research contributes to the field by proposing a robust framework that combines advanced technologies, potentially paving the way for more intuitive robotic applications.
Key Takeaways
- The framework combines vision-language models, speech processing, and fuzzy logic for improved human-robot interaction.
- Experimental results show a command execution accuracy of 75%, indicating the system's reliability.
- The integration of technologies like Florence-2, Llama 3.1, and Whisper enhances the interface for object manipulation.
- This research provides a flexible foundation for future advancements in human-robot collaboration.
- The approach addresses both scene perception and action planning, crucial for effective command interpretation.
Computer Science > Robotics arXiv:2602.20219 (cs) [Submitted on 23 Feb 2026] Title:An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction Authors:Guanting Shen, Zi Tian View a PDF of the paper titled An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction, by Guanting Shen and 1 other authors View PDF HTML (experimental) Abstract:Interpreting human intent accurately is a central challenge in human-robot interaction (HRI) and a key requirement for achieving more natural and intuitive collaboration between humans and machines. This work presents a novel multimodal HRI framework that combines advanced vision-language models, speech processing, and fuzzy logic to enable precise and adaptive control of a Dobot Magician robotic arm. The proposed system integrates Florence-2 for object detection, Llama 3.1 for natural language understanding, and Whisper for speech recognition, providing users with a seamless and intuitive interface for object manipulation through spoken commands. By jointly addressing scene perception and action planning, the approach enhances the reliability of command interpretation and execution. Experimental evaluations conducted on consumer-grade hardware demonstrate a command execution accuracy of 75\%, highlighting both the robustness and adaptability of the system. Beyond its current performance, the proposed architecture serves as a flexible and extensible foundation for fut...