[2602.18466] Can Multimodal LLMs See Science Instruction? Benchmarking Pedagogical Reasoning in K-12 Classroom Videos

[2602.18466] Can Multimodal LLMs See Science Instruction? Benchmarking Pedagogical Reasoning in K-12 Classroom Videos

arXiv - AI 4 min read Article

Summary

This article evaluates the effectiveness of multimodal large language models (LLMs) in analyzing K-12 science classroom discourse, revealing significant limitations in pedagogical reasoning.

Why It Matters

Understanding how AI can analyze complex classroom interactions is crucial for improving educational technologies. This research highlights the challenges faced by current models in grasping pedagogical nuances, which is vital for developing effective AI tools in education.

Key Takeaways

  • The study introduces SciIBI, a benchmark for analyzing science classroom videos.
  • Current multimodal LLMs struggle with distinguishing pedagogical practices, indicating a need for deeper instructional reasoning.
  • Video input does not consistently enhance model performance, suggesting limitations in current architectures.
  • Models often rely on surface-level patterns rather than true understanding of pedagogy.
  • The findings advocate for human-AI collaboration in educational contexts.

Computer Science > Computers and Society arXiv:2602.18466 (cs) [Submitted on 8 Feb 2026] Title:Can Multimodal LLMs See Science Instruction? Benchmarking Pedagogical Reasoning in K-12 Classroom Videos Authors:Yixuan Shen, Peng He, Honglu Liu, Yuyang Ji, Tingting Li, Tianlong Chen, Kaidi Xu, Feng Liu View a PDF of the paper titled Can Multimodal LLMs See Science Instruction? Benchmarking Pedagogical Reasoning in K-12 Classroom Videos, by Yixuan Shen and 7 other authors View PDF HTML (experimental) Abstract:K-12 science classrooms are rich sites of inquiry where students coordinate phenomena, evidence, and explanatory models through discourse; yet, the multimodal complexity of these interactions has made automated analysis elusive. Existing benchmarks for classroom discourse focus primarily on mathematics and rely solely on transcripts, overlooking the visual artifacts and model-based reasoning emphasized by the Next Generation Science Standards (NGSS). We address this gap with SciIBI, the first video benchmark for analyzing science classroom discourse, featuring 113 NGSS-aligned clips annotated with Core Instructional Practices (CIP) and sophistication levels. By evaluating eight state-of-the-art LLMs and Multimodal LLMs, we reveal fundamental limitations: current models struggle to distinguish pedagogically similar practices, suggesting that CIP coding requires instructional reasoning beyond surface pattern matching. Furthermore, adding video input yields inconsistent gains...

Related Articles

Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED
Llms

Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED

Plus: The FBI says a recent hack of its wiretap tools poses a national security risk, attackers stole Cisco source code as part of an ong...

Wired - AI · 9 min ·
Llms

People anxious about deviating from what AI tells them to do?

My friend came over yesterday to dye her hair. She had asked ChatGPT for the 'correct' way to do it. Chat told her to dye the ends first,...

Reddit - Artificial Intelligence · 1 min ·
Llms

ChatGPT on trial: A landmark test of AI liability in the practice of law

AI Tools & Products ·
Llms

What if Claude purposefully made its own code leakable so that it would get leaked

What if Claude leaked itself by socially and architecturally engineering itself to be leaked by a dumb human submitted by /u/smurfcsgoawp...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime