[2602.18540] Rodent-Bench
Summary
Rodent-Bench introduces a benchmark for evaluating Multimodal Large Language Models (MLLMs) in annotating rodent behavior videos, revealing significant performance limitations.
Why It Matters
This benchmark is crucial for advancing automated behavioral annotation in neuroscience, highlighting the current shortcomings of MLLMs in handling complex video data. It sets a foundation for future improvements in model development and application in scientific research.
Key Takeaways
- Rodent-Bench evaluates MLLMs on their ability to annotate rodent behavior footage.
- Current state-of-the-art models struggle with tasks like temporal segmentation and subtle behavior distinction.
- The benchmark includes diverse datasets and standardized metrics for comprehensive evaluation.
- Modest performance was noted in specific behaviors like grooming, but overall results indicate significant challenges.
- Insights from this study can guide future developments in automated behavioral annotation.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.18540 (cs) [Submitted on 20 Feb 2026] Title:Rodent-Bench Authors:Thomas Heap, Laurence Aitchison, Emma Cahill, Adriana Casado Rodriguez View a PDF of the paper titled Rodent-Bench, by Thomas Heap and 3 other authors View PDF HTML (experimental) Abstract:We present Rodent-Bench, a novel benchmark designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to annotate rodent behaviour footage. We evaluate state-of-the-art MLLMs, including Gemini-2.5-Pro, Gemini-2.5-Flash and Qwen-VL-Max, using this benchmark and find that none of these models perform strongly enough to be used as an assistant for this task. Our benchmark encompasses diverse datasets spanning multiple behavioral paradigms including social interactions, grooming, scratching, and freezing behaviors, with videos ranging from 10 minutes to 35 minutes in length. We provide two benchmark versions to accommodate varying model capabilities and establish standardized evaluation metrics including second-wise accuracy, macro F1, mean average precision, mutual information, and Matthew's correlation coefficient. While some models show modest performance on certain datasets (notably grooming detection), overall results reveal significant challenges in temporal segmentation, handling extended video sequences, and distinguishing subtle behavioral states. Our analysis identifies key limitations in current MLLMs for scientific video annotati...