[2602.20219] An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction

[2602.20219] An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction

arXiv - AI 3 min read Article

Summary

This article presents a novel multimodal framework for human-robot interaction that integrates video and speech processing with large language models, enhancing command execution accuracy and adaptability in robotic systems.

Why It Matters

As human-robot interaction becomes increasingly prevalent, developing systems that can accurately interpret human intent is crucial for seamless collaboration. This research contributes to the field by proposing a robust framework that combines advanced technologies, potentially paving the way for more intuitive robotic applications.

Key Takeaways

  • The framework combines vision-language models, speech processing, and fuzzy logic for improved human-robot interaction.
  • Experimental results show a command execution accuracy of 75%, indicating the system's reliability.
  • The integration of technologies like Florence-2, Llama 3.1, and Whisper enhances the interface for object manipulation.
  • This research provides a flexible foundation for future advancements in human-robot collaboration.
  • The approach addresses both scene perception and action planning, crucial for effective command interpretation.

Computer Science > Robotics arXiv:2602.20219 (cs) [Submitted on 23 Feb 2026] Title:An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction Authors:Guanting Shen, Zi Tian View a PDF of the paper titled An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction, by Guanting Shen and 1 other authors View PDF HTML (experimental) Abstract:Interpreting human intent accurately is a central challenge in human-robot interaction (HRI) and a key requirement for achieving more natural and intuitive collaboration between humans and machines. This work presents a novel multimodal HRI framework that combines advanced vision-language models, speech processing, and fuzzy logic to enable precise and adaptive control of a Dobot Magician robotic arm. The proposed system integrates Florence-2 for object detection, Llama 3.1 for natural language understanding, and Whisper for speech recognition, providing users with a seamless and intuitive interface for object manipulation through spoken commands. By jointly addressing scene perception and action planning, the approach enhances the reliability of command interpretation and execution. Experimental evaluations conducted on consumer-grade hardware demonstrate a command execution accuracy of 75\%, highlighting both the robustness and adaptability of the system. Beyond its current performance, the proposed architecture serves as a flexible and extensible foundation for fut...

Related Articles

[2603.18940] Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought
Llms

[2603.18940] Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

Abstract page for arXiv paper 2603.18940: Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty ...

arXiv - Machine Learning · 3 min ·
[2511.10876] Architecting software monitors for control-flow anomaly detection through large language models and conformance checking
Llms

[2511.10876] Architecting software monitors for control-flow anomaly detection through large language models and conformance checking

Abstract page for arXiv paper 2511.10876: Architecting software monitors for control-flow anomaly detection through large language models...

arXiv - Machine Learning · 4 min ·
[2512.02425] WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
Llms

[2512.02425] WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

Abstract page for arXiv paper 2512.02425: WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

arXiv - Machine Learning · 4 min ·
[2511.00810] GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding
Llms

[2511.00810] GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding

Abstract page for arXiv paper 2511.00810: GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding

arXiv - Machine Learning · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime