[2602.21779] Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models
Summary
This paper introduces a forensic benchmark for evaluating video deepfake reasoning in vision-language models, focusing on temporal inconsistencies rather than just spatial artifacts.
Why It Matters
As deepfake technology evolves, traditional detection methods that focus solely on static artifacts are becoming inadequate. This research addresses the critical need for models that can analyze dynamic inconsistencies in video content, enhancing the reliability of deepfake detection systems. The proposed benchmark can significantly improve the capabilities of vision-language models in forensic applications, making it relevant for researchers and practitioners in AI safety and computer vision.
Key Takeaways
- Current models excel at detecting spatial artifacts but struggle with temporal inconsistencies in videos.
- The Forensic Answer-Questioning (FAQ) benchmark introduces a structured approach to evaluate temporal deepfake analysis.
- Fine-tuning on the FAQ-IT instruction set significantly improves model performance on deepfake detection tasks.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.21779 (cs) [Submitted on 25 Feb 2026] Title:Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models Authors:Zheyuan Gu, Qingsong Zhao, Yusong Wang, Zhaohong Huang, Xinqi Li, Cheng Yuan, Jiaowei Shao, Chi Zhang, Xuelong Li View a PDF of the paper titled Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models, by Zheyuan Gu and 7 other authors View PDF HTML (experimental) Abstract:Current Vision-Language Models (VLMs) for deepfake detection excel at identifying spatial artifacts but overlook a critical dimension: temporal inconsistencies in video forgeries. Adapting VLMs to reason about these dynamic cues remains a distinct challenge. To bridge this gap, we propose Forensic Answer-Questioning (FAQ), a large-scale benchmark that formulates temporal deepfake analysis as a multiple-choice task. FAQ introduces a three-level hierarchy to progressively evaluate and equip VLMs with forensic capabilities: (1) Facial Perception, testing the ability to identify static visual artifacts; (2) Temporal Deepfake Grounding, requiring the localization of dynamic forgery artifacts across frames; and (3) Forensic Reasoning, challenging models to synthesize evidence for final authenticity verdicts. We evaluate a range of VLMs on FAQ and generate a corresponding instruction-tuning set, FAQ-IT. Extensive experiments show that models fine-t...