Llms Machine Learning Robotics Computer Vision Generative Ai

[2602.00288] TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs

arXiv - AI February 23, 2026 4 min read Article

Summary

The paper presents TimeBlind, a benchmark designed to evaluate the spatio-temporal understanding of video Large Language Models (LLMs), highlighting their limitations in temporal reasoning compared to human performance.

Why It Matters

As video content becomes increasingly prevalent, understanding temporal dynamics is crucial for AI systems. TimeBlind offers a structured approach to assess and improve video reasoning capabilities in LLMs, addressing a significant gap in current AI benchmarks.

Key Takeaways

TimeBlind categorizes temporal understanding into three levels: atomic events, event properties, and interdependencies.
Current MLLMs show only 48.2% accuracy in distinguishing temporal dynamics, far below human performance at 98.2%.
The benchmark utilizes a minimal-pairs paradigm to isolate temporal reasoning from static visual cues.
TimeBlind serves as a diagnostic tool to guide the development of future video understanding models.
The dataset and code are publicly available, promoting further research in this area.

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.00288 (cs) [Submitted on 30 Jan 2026 (v1), last revised 19 Feb 2026 (this version, v2)] Title:TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs Authors:Baiqi Li, Kangyi Zhao, Ce Zhang, Chancharik Mitra, Jean de Dieu Nyandwi, Gedas Bertasius View a PDF of the paper titled TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs, by Baiqi Li and 5 other authors View PDF HTML (experimental) Abstract:Fine-grained spatio-temporal understanding is essential for video reasoning and embodied AI. Yet, while Multimodal Large Language Models (MLLMs) master static semantics, their grasp of temporal dynamics remains brittle. We present TimeBlind, a diagnostic benchmark for compositional spatio-temporal understanding. Inspired by cognitive science, TimeBlind categorizes fine-grained temporal understanding into three levels: recognizing atomic events, characterizing event properties, and reasoning about event interdependencies. Unlike benchmarks that conflate recognition with temporal reasoning, TimeBlind leverages a minimal-pairs paradigm: video pairs share identical static visual content but differ solely in temporal structure, utilizing complementary questions to neutralize language priors. Evaluating over 20 state-of-the-art MLLMs (e.g., GPT-5, Gemini 3 Pro) on 600 curated instances (2400 video-question pairs), reveals that the Instance Accuracy (correctly distinguishing both vide...

Read Original Article

[2602.00288] TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs

Summary

Why It Matters

Key Takeaways

Related Articles

An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

[R] An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

What does Gemini think of you?

This app helps you see what LLMs you can run on your hardware

No comments

Stay updated with AI News