[2602.00288] TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs

[2602.00288] TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs

arXiv - AI 4 min read Article

Summary

The paper presents TimeBlind, a benchmark designed to evaluate the spatio-temporal understanding of video Large Language Models (LLMs), highlighting their limitations in temporal reasoning compared to human performance.

Why It Matters

As video content becomes increasingly prevalent, understanding temporal dynamics is crucial for AI systems. TimeBlind offers a structured approach to assess and improve video reasoning capabilities in LLMs, addressing a significant gap in current AI benchmarks.

Key Takeaways

  • TimeBlind categorizes temporal understanding into three levels: atomic events, event properties, and interdependencies.
  • Current MLLMs show only 48.2% accuracy in distinguishing temporal dynamics, far below human performance at 98.2%.
  • The benchmark utilizes a minimal-pairs paradigm to isolate temporal reasoning from static visual cues.
  • TimeBlind serves as a diagnostic tool to guide the development of future video understanding models.
  • The dataset and code are publicly available, promoting further research in this area.

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.00288 (cs) [Submitted on 30 Jan 2026 (v1), last revised 19 Feb 2026 (this version, v2)] Title:TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs Authors:Baiqi Li, Kangyi Zhao, Ce Zhang, Chancharik Mitra, Jean de Dieu Nyandwi, Gedas Bertasius View a PDF of the paper titled TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs, by Baiqi Li and 5 other authors View PDF HTML (experimental) Abstract:Fine-grained spatio-temporal understanding is essential for video reasoning and embodied AI. Yet, while Multimodal Large Language Models (MLLMs) master static semantics, their grasp of temporal dynamics remains brittle. We present TimeBlind, a diagnostic benchmark for compositional spatio-temporal understanding. Inspired by cognitive science, TimeBlind categorizes fine-grained temporal understanding into three levels: recognizing atomic events, characterizing event properties, and reasoning about event interdependencies. Unlike benchmarks that conflate recognition with temporal reasoning, TimeBlind leverages a minimal-pairs paradigm: video pairs share identical static visual content but differ solely in temporal structure, utilizing complementary questions to neutralize language priors. Evaluating over 20 state-of-the-art MLLMs (e.g., GPT-5, Gemini 3 Pro) on 600 curated instances (2400 video-question pairs), reveals that the Instance Accuracy (correctly distinguishing both vide...

Related Articles

Llms

An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

https://shapingrooms.com/research I published a paper today on something I've been calling postural manipulation. The short version: ordi...

Reddit - Artificial Intelligence · 1 min ·
Llms

[R] An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

https://shapingrooms.com/research I've been documenting what I'm calling postural manipulation: a specific class of language that install...

Reddit - Machine Learning · 1 min ·
Llms

What does Gemini think of you?

I noticed that Gemini was referring back to a lot of queries I've made in the past and was using that knowledge to drive follow up prompt...

Reddit - Artificial Intelligence · 1 min ·
Llms

This app helps you see what LLMs you can run on your hardware

submitted by /u/dev_is_active [link] [comments]

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime