ICLR2025

LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

Junyan Ye, Baichuan Zhou, Zilong Huang, Junan Zhang, Tianyi Bai, Hengrui Kang, Jun He, Honglin Lin, Zihao Wang, Tong Wu, Zhizheng Wu, Yiping Chen, Dahua Lin, Conghui He, Weijia Li

摘要

On the LOKI benchmark, we evaluated 22 open-source LMMs, 6 advanced proprietary LMMs, and several expert synthetic detection models. Our key findings are summarized as follows: For synthetic data detection tasks we find: (1) LMMs exhibit moderate capabilities in synthetic data detection tasks, with certain levels of explainability and generalization, but there is still a gap compared to human performance; (2) Compared to expert synthetic detection models, LMMs exhibit greater explainability and, compared to humans, can detect features invisible to the naked eye, demonstrating promising developmental prospects. Published as a conference paper at ICLR 2025 For LMMs capabilities we find: (1) Most LMMs exhibit certain model biases, tending to favor synthetic or real data in their responses; (2) LMMs lack of expert domain knowledge, performing poorly on specialized image types like satellite and medical images; (3) Current LMMs show unbalanced multimodal capabilities, excelling in image and text tasks but underperforming in 3D and audio tasks; (4) Chain-of-thought prompting enhances LMMs' performance in synthetic data detection, whereas simple few-shot prompting falls short of providing the necessary reasoning support. These findings highlight the challenging and comprehensive nature of the LOKI task and the promising future of LMMs in synthetic data detection tasks. RELATED WORK 2.1 SYNTHETIC DATA DETECTION Currently, synthetic data detection has garnered widespread attention to prevent the misuse of multimedia synthetic data (Gragnaniello et al., 2021; Hou et al., 2023) . The detection of synthetic data in image and audio has long been a popular research (Barni et al., 2020; Frank et al., 2020) , while methods for synthetic video detection have recently emerged, such as DuB3D (Ji et al., 2024) and AIGVDet(Bai et al., 2024a). However, most work primarily focuses on the binary distinction between authentic and synthetic data, resulting in poor interpretability. Some studies aim to enhance the interpretability of synthetic detection by providing latent representations (Dong et al., 2022) , feature explanations (Chai et al., 2020) , and artifact localization (Zhang et al., 2023a; Shao et al., 2023; 2024); however, most research remains limited to the interpretability of abstract symbols, leaving a significant gap in alignment with human understanding. In practice, current AI-generated synthetic data still exhibits noticeable flaws, such as discontinuities in synthetic videos and insufficient geometric accuracy in 3D data. These shortcomings can be effectively captured and perceived by human users (Tariang et al., 2024) , who can provide reasonable explanations. However, existing expert synthetic data detection methods fail to provide human-interpretable bases for their judgments. LARGE MULTIMODAL MODELS Recently, the rapid development of multimodal large models (LMMs) has been notable, with models like GPT-4o (OpenAI, 2024) and Claude 3.5 (Anthropic, 2024) excelling in various tasks such as scientific questioning (Lu et al., 2022; Yue et al., 2024) and commonsense reasoning (Talmor et al., 2018) , showcasing exceptional perceptual and reasoning abilities (Bai et al., 2024b). Research has also applied LMMs to evaluate AIGC synthetic results, utilizing GPT to assess the quality of generated images (Ku et al., 2023; Peng et al., 2024) and 3D models (Wu et al., 2024b), providing scores that align with human preferences along with interpretable justifications. Consequently, in synthetic data detection, LMMs can offer reasons for determining authenticity in natural language, paving the way for enhanced interpretability in synthetic detection. Moreover, LMMs can access features invisible to human users, such as deep image and spectral features, demonstrating their potential to exceed human detection capabilities. Furthermore, synthetic data detection involves multimodal data perception and complex logical reasoning, making it an excellent task to assess the capabilities of LMMs. This task also provides quantitative evaluation metrics like accuracy, allowing for a more direct assessment of model performance compared to more qualitative scoring tasks. SYNTHETIC DATA DETECTION BENCHMARK Currently, there are numerous datasets corresponding to synthetic data detection tasks, including those designed for traditional detection methods and those tailored for LMMs. For instance, traditional synthetic datasets such as Fake2M (Lu et al., 2023b), HC3 (Guo et al., 2023), and ASVSpoof 2019 (Wang et al., 2020b) have explored the performance of traditional deepfake detection methods across various modalities, but they lack assessments for LMMs models. VANE (Bharadwaj et al., 2024) evaluates the capability of LMMs in detecting video anomalies, including the detection of criminal activities in real videos and synthetic video detection, although it focuses more on video content understanding. Fakebench (Li et al., 2024b) assesses LMM pe