CVPR2025

Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents

Jun Chen, Dannong Xu, Junjie Fei, Chun-Mei Feng, Mohamed Elhoseiny

摘要

In which subject, Dr Alexander R. Inglis did his masters degree? Q2: What percent of analytics jobs in India requires more than 5 years of experience according to the 2017 study? (a) Prior Benchmarks: Each Question Paired with a Limited Image Set (b) Our Benchmark: All Questions Mapped to an Extensive Document Collection RetVQA & WebVQA: Q1: Does grass and sky share the same color? Q2: Are the satellites on the Soviet space control/monitoring ship Kosmonavt Yuriy Gagarin always oriented in the same direction? (a) Previous Benchmarks: Each question paired with a limited image set (b) Our Benchmarks: All questions mapped to an extensive document collection RetVQA & WebVQA Figure 1. Comparison between previous and proposed benchmarks. Given a question as input, all benchmarks aim to retrieve relevant images from an image pool to correctly answer the question. Unlike prior benchmarks like RetVQA [32] and WebVQA [7] , which structure their datasets by pairing each question with a limited set of images (typically ≤ 30), our benchmarks, DocHaystack and InfoHaystack, map each question to a substantially larger document collection, scaling up to 1,000 visual documents. This expanded scope more accurately represents large-scale document retrieval scenarios and offers a greater challenge in retrieval accuracy and visual question answering.