ACL2025

SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification

Chengye Wang, Yifei Shen, Zexi Kuang, Arman Cohan, Yilun Zhao

8 citations

Abstract

We introduce SCIVER, the first benchmark specifically designed to evaluate the ability of foundation models to verify claims within a multimodal scientific context. SCIVER consists of 3,000 expert-annotated examples over 1,113 scientific papers, covering four subsets, each representing a common reasoning type in multimodal scientific claim verification. To enable fine-grained evaluation, each example includes expert-annotated supporting evidence. We assess the performance of 21 state-of-the-art multimodal foundation models, including o4mini, Gemini-2.5-Flash, Llama-3.2-Vision, and Qwen2.5-VL. Our experiment reveals a substantial performance gap between these models and human experts on SCIVER. Through an in-depth analysis of retrieval-augmented generation (RAG), and human-conducted error evaluations, we identify critical limitations in current open-source models, offering key insights to advance models' comprehension and reasoning in multimodal scientific literature tasks. Data chengyewang/SciVer Code QDRhhhh/SciVer Caption: Dense captioning descriptiveness precision recall results for LLaVA-7B fine-tuned with DOCCI captions, adapted using different methods. "Trimmed" refers to naive removal of sentences, while "Gemini" involves prompting Gemini to simplify the caption. Caption: Dense captioning results over the test sets of DOCCI when fine-tuning on original human-annotated captions, synthetic captions, and KnowAda-adapted captions (denoted as KA) with a threshold of 20%. "Automatic (Auto)" refers to model-based NLI evaluation, while "Human" refers to evaluations based on human labeling. To ensure that KnowAda is robust across multiple models and datasets, we fix the threshold at 20% for classifying questions as unknown and finetune three models: PaliGemma, TinyLLaVA, and LLaVA-1.57B. We fine-tune on two variations of DOCCI: one using the original DOCCI captions, and another using synthetically generated captions created by Gemini, which were prompted to be visually descriptive. We evaluate the models using both an automatic NLI model and human annotators, as detailed in Section 3. In all experiments, we split the DOCCI test set into 1,000 sampled for evaluation.