[2603.11698] OSCBench: Benchmarking Object State Change in Text-to-Video Generation
About this article
Abstract page for arXiv paper 2603.11698: OSCBench: Benchmarking Object State Change in Text-to-Video Generation
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.11698 (cs) [Submitted on 12 Mar 2026 (v1), last revised 17 Apr 2026 (this version, v2)] Title:OSCBench: Benchmarking Object State Change in Text-to-Video Generation Authors:Xianjing Han, Bin Zhu, Shiqi Hu, Franklin Mingzhe Li, Patrick Carrington, Roger Zimmermann, Jingjing Chen View a PDF of the paper titled OSCBench: Benchmarking Object State Change in Text-to-Video Generation, by Xianjing Han and 6 other authors View PDF HTML (experimental) Abstract:Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text-video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object's state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action-object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization. We evaluate six representative open-source and proprietary T2V models using both human user study and multimodal large language mod...