[2602.18540] Rodent-Bench

[2602.18540] Rodent-Bench

arXiv - AI 3 min read Article

Summary

Rodent-Bench introduces a benchmark for evaluating Multimodal Large Language Models (MLLMs) in annotating rodent behavior videos, revealing significant performance limitations.

Why It Matters

This benchmark is crucial for advancing automated behavioral annotation in neuroscience, highlighting the current shortcomings of MLLMs in handling complex video data. It sets a foundation for future improvements in model development and application in scientific research.

Key Takeaways

  • Rodent-Bench evaluates MLLMs on their ability to annotate rodent behavior footage.
  • Current state-of-the-art models struggle with tasks like temporal segmentation and subtle behavior distinction.
  • The benchmark includes diverse datasets and standardized metrics for comprehensive evaluation.
  • Modest performance was noted in specific behaviors like grooming, but overall results indicate significant challenges.
  • Insights from this study can guide future developments in automated behavioral annotation.

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.18540 (cs) [Submitted on 20 Feb 2026] Title:Rodent-Bench Authors:Thomas Heap, Laurence Aitchison, Emma Cahill, Adriana Casado Rodriguez View a PDF of the paper titled Rodent-Bench, by Thomas Heap and 3 other authors View PDF HTML (experimental) Abstract:We present Rodent-Bench, a novel benchmark designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to annotate rodent behaviour footage. We evaluate state-of-the-art MLLMs, including Gemini-2.5-Pro, Gemini-2.5-Flash and Qwen-VL-Max, using this benchmark and find that none of these models perform strongly enough to be used as an assistant for this task. Our benchmark encompasses diverse datasets spanning multiple behavioral paradigms including social interactions, grooming, scratching, and freezing behaviors, with videos ranging from 10 minutes to 35 minutes in length. We provide two benchmark versions to accommodate varying model capabilities and establish standardized evaluation metrics including second-wise accuracy, macro F1, mean average precision, mutual information, and Matthew's correlation coefficient. While some models show modest performance on certain datasets (notably grooming detection), overall results reveal significant challenges in temporal segmentation, handling extended video sequences, and distinguishing subtle behavioral states. Our analysis identifies key limitations in current MLLMs for scientific video annotati...

Related Articles

Llms

What does Gemini think of you?

I noticed that Gemini was referring back to a lot of queries I've made in the past and was using that knowledge to drive follow up prompt...

Reddit - Artificial Intelligence · 1 min ·
Llms

This app helps you see what LLMs you can run on your hardware

submitted by /u/dev_is_active [link] [comments]

Reddit - Artificial Intelligence · 1 min ·
Llms

TRACER: Learn-to-Defer for LLM Classification with Formal Teacher-Agreement Guarantees

I'm releasing TRACER (Trace-Based Adaptive Cost-Efficient Routing), a library for learning cost-efficient routing policies from LLM trace...

Reddit - Machine Learning · 1 min ·
Mistral AI raises $830M in debt to set up a data center near Paris | TechCrunch
Llms

Mistral AI raises $830M in debt to set up a data center near Paris | TechCrunch

Mistral aims to start operating the data center by the second quarter of 2026.

TechCrunch - AI · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime