[2604.03631] Single-agent vs. Multi-agents for Automated Video Analysis of On-Screen Collaborative Learning Behaviors
About this article
Abstract page for arXiv paper 2604.03631: Single-agent vs. Multi-agents for Automated Video Analysis of On-Screen Collaborative Learning Behaviors
Computer Science > Artificial Intelligence arXiv:2604.03631 (cs) [Submitted on 4 Apr 2026] Title:Single-agent vs. Multi-agents for Automated Video Analysis of On-Screen Collaborative Learning Behaviors Authors:Likai Peng, Shihui Feng View a PDF of the paper titled Single-agent vs. Multi-agents for Automated Video Analysis of On-Screen Collaborative Learning Behaviors, by Likai Peng and Shihui Feng View PDF HTML (experimental) Abstract:On-screen learning behavior provides valuable insights into how students seek, use, and create information during learning. Analyzing on-screen behavioral engagement is essential for capturing students' cognitive and collaborative processes. The recent development of Vision Language Models (VLMs) offers new opportunities to automate the labor-intensive manual coding often required for multimodal video data analysis. In this study, we compared the performance of both leading closed-source VLMs (Claude-3.7-Sonnet, GPT-4.1) and open-source VLM (Qwen2.5-VL-72B) in single- and multi-agent settings for automated coding of screen recordings in collaborative learning contexts based on the ICAP framework. In particular, we proposed and compared two multi-agent frameworks: 1) a three-agent workflow multi-agent system (MAS) that segments screen videos by scene and detects on-screen behaviors using cursor-informed VLM prompting with evidence-based verification; 2) an autonomous-decision MAS inspired by ReAct that iteratively interleaves reasoning, tool-l...