[2602.18880] FOCA: Frequency-Oriented Cross-Domain Forgery Detection, Localization and Explanation via Multi-Modal Large Language Model
Summary
The paper presents FOCA, a novel framework for detecting and localizing image forgery using a multi-modal large language model that integrates spatial and frequency domain features.
Why It Matters
As image tampering techniques evolve, ensuring the integrity of digital media becomes critical for public trust and security. FOCA addresses current limitations in forgery detection by enhancing interpretability and accuracy, which is vital for applications in digital forensics and media verification.
Key Takeaways
- FOCA integrates RGB spatial and frequency domain features for improved forgery detection.
- The framework enhances interpretability of tampering traces, making it user-friendly.
- FSE-Set, a new dataset, supports diverse image analysis for training models.
- FOCA outperforms existing methods in both detection performance and interpretability.
- The research highlights the importance of cross-domain analysis in AI applications.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.18880 (cs) [Submitted on 21 Feb 2026] Title:FOCA: Frequency-Oriented Cross-Domain Forgery Detection, Localization and Explanation via Multi-Modal Large Language Model Authors:Zhou Liu, Tonghua Su, Hongshi Zhang, Fuxiang Yang, Donglin Di, Yang Song, Lei Fan View a PDF of the paper titled FOCA: Frequency-Oriented Cross-Domain Forgery Detection, Localization and Explanation via Multi-Modal Large Language Model, by Zhou Liu and 6 other authors View PDF HTML (experimental) Abstract:Advances in image tampering techniques, particularly generative models, pose significant challenges to media verification, digital forensics, and public trust. Existing image forgery detection and localization (IFDL) methods suffer from two key limitations: over-reliance on semantic content while neglecting textural cues, and limited interpretability of subtle low-level tampering traces. To address these issues, we propose FOCA, a multimodal large language model-based framework that integrates discriminative features from both the RGB spatial and frequency domains via a cross-attention fusion module. This design enables accurate forgery detection and localization while providing explicit, human-interpretable cross-domain explanations. We further introduce FSE-Set, a large-scale dataset with diverse authentic and tampered images, pixel-level masks, and dual-domain annotations. Extensive experiments show that FOCA outperforms state-o...