ACL2024

REFINESUMM: Self-Refining MLLM for Generating a Multimodal Summarization Dataset

Vaidehi Patil, Leonardo F. R. Ribeiro, Mengwen Liu, Mohit Bansal, Markus Dreyer

摘要

Multimodal Large Language Models (MLLMs) excel at synthesizing key information from diverse sources. However, generating accurate and faithful multimodal summaries is challenging, primarily due to the lack of appropriate multimodal datasets for fine-tuning that meaningfully integrate textual and visual modalities. To address this gap, we present a new dataset specifically designed for image-text multimodal summarization, harnessing the capabilities of state-of-the-art MLLMs. We generate summaries from Wikipedia sections and corresponding images and evaluate them across text-based, visual and multimodal dimensions, employing reference-free metrics. To refine the dataset, we: (1) filter the MLLM-generated summaries by training a critic model on human annotations and using its predictions to remove low-quality summaries; (2) fine-tune the MLLM with the filtered high-quality summaries; (3) use the fine-tuned model in turn to regenerate the summaries. This self-refinement process notably improves summary quality, as measured by human judgments and automatic multimodal metrics, resulting in a valuable dataset for multimodal summarization research. 1 * Work done as an intern at Amazon AGI. 1 The dataset is publicly available at https://github. com/amazon-science/refinesumm . The Italian wall lizard or ruin lizard (Podarcis siculus, from the Greek meaning agile and feet) is a species of lizard in the family Lacertidae. P. siculus is native to Bosnia and Herzegovina,