ICLR2026

A Structured, Tagged, and Localized Visual Question Answering Dataset with Full Sentence Answers and Scene Graphs for Chest X-ray Images

Philip Müller, Friederike Jungmann, Georgios Kaissis, Daniel Rueckert

摘要

Visual Question Answering (VQA) enables targeted and context-dependent analysis of medical images, such as chest X-rays (CXRs). However, existing VQA datasets for CXRs are typically constrained by simplistic and brief answer formats, lacking localization annotations (e.g., bounding boxes) and structured tags (e.g., region or radiological finding/disease tags). To address these limitations, we introduce MIMIC-Ext-CXR-QBA (abbr. CXR-QBA), a large-scale CXR VQA dataset derived from MIMIC-CXR, comprising 42 million QA-pairs with multi-granular, multi-part answers, detailed bounding boxes, and structured tags. We automatically generated our VQA dataset from scene graphs (also made available), which we constructed using LLM-based information extraction from radiology reports. After automatic quality assessment, we identified 31M pre-training and 7.5M fine-tuning grade QA-pairs, providing the largest and most sophisticated VQA dataset for CXRs to date. Tools for using our dataset and the construction pipeline are available at https://github.com/philip-mueller/mimic-ext-cxr-qba/ . INTRODUCTION With the emergence of Large Language Models (LLMs) and Large Multimodal Models (LMMs), interactive and conversational tasks have gained popularity in medical image analysis, particularly in the context of chest X-ray (CXR) interpretation (Chen et al., 2024; Müller et al., 2025; Tu et al., 2024; Xie et al., 2025) . A prominent example of such interactive tasks is Visual Question Answering (VQA), where a model is presented with an image and a corresponding textual question, and is tasked with generating an answer. Unlike conventional medical imaging approaches, which always produce the same output (such as classification labels, bounding boxes, or textual reports) for a given image, VQA enables users to interactively explore and interpret images in a context-dependent manner. Training robust VQA models for medical applications requires high-quality, large-scale training datasets. Existing CXR VQA datasets suffer from several limitations: (i) they often contain only short and simplistic answers, (ii) they lack localization information (such as bounding boxes), and (iii) they provide little structured metadata (e.g., region and finding/disease annotations, or uncertainty estimates). Additionally, their relatively small size constrains their utility for pretraining. To address these challenges, we propose a pipeline for automatic VQA dataset creation and apply it to construct a new large-scale CXR VQA dataset. Unlike prior datasets, each question-answer (QA) pair includes multi-granular, multi-part answers composed of full sentences in the style of radiology reports. Furthermore, our dataset provides detailed bounding boxes and additional structured tags (e.g., findings and regions), enhancing interpretability and facilitating the development of more advanced and transparent medical VQA models. Fig. 1 shows examples of our generated QA-pairs. Published as a conference paper at ICLR 2026 Reference definitions Segment 158 regions Quality Clear lungs. 377,110 chest X-ray images 227,827 free-text radiology reports Extract sentences Derive 257 regions Localize 29 regions Chest Imagenome Extract indication Extract observations Execute QA Generation Strategies CXAS Reference definitions Question templates Fine-tuning Grade