[2602.18466] Can Multimodal LLMs See Science Instruction? Benchmarking Pedagogical Reasoning in K-12 Classroom Videos
Summary
This article evaluates the effectiveness of multimodal large language models (LLMs) in analyzing K-12 science classroom discourse, revealing significant limitations in pedagogical reasoning.
Why It Matters
Understanding how AI can analyze complex classroom interactions is crucial for improving educational technologies. This research highlights the challenges faced by current models in grasping pedagogical nuances, which is vital for developing effective AI tools in education.
Key Takeaways
- The study introduces SciIBI, a benchmark for analyzing science classroom videos.
- Current multimodal LLMs struggle with distinguishing pedagogical practices, indicating a need for deeper instructional reasoning.
- Video input does not consistently enhance model performance, suggesting limitations in current architectures.
- Models often rely on surface-level patterns rather than true understanding of pedagogy.
- The findings advocate for human-AI collaboration in educational contexts.
Computer Science > Computers and Society arXiv:2602.18466 (cs) [Submitted on 8 Feb 2026] Title:Can Multimodal LLMs See Science Instruction? Benchmarking Pedagogical Reasoning in K-12 Classroom Videos Authors:Yixuan Shen, Peng He, Honglu Liu, Yuyang Ji, Tingting Li, Tianlong Chen, Kaidi Xu, Feng Liu View a PDF of the paper titled Can Multimodal LLMs See Science Instruction? Benchmarking Pedagogical Reasoning in K-12 Classroom Videos, by Yixuan Shen and 7 other authors View PDF HTML (experimental) Abstract:K-12 science classrooms are rich sites of inquiry where students coordinate phenomena, evidence, and explanatory models through discourse; yet, the multimodal complexity of these interactions has made automated analysis elusive. Existing benchmarks for classroom discourse focus primarily on mathematics and rely solely on transcripts, overlooking the visual artifacts and model-based reasoning emphasized by the Next Generation Science Standards (NGSS). We address this gap with SciIBI, the first video benchmark for analyzing science classroom discourse, featuring 113 NGSS-aligned clips annotated with Core Instructional Practices (CIP) and sophistication levels. By evaluating eight state-of-the-art LLMs and Multimodal LLMs, we reveal fundamental limitations: current models struggle to distinguish pedagogically similar practices, suggesting that CIP coding requires instructional reasoning beyond surface pattern matching. Furthermore, adding video input yields inconsistent gains...