EMNLP2025

RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

Yuchi Wang, Yishuo Cai, Shuhuai Ren, Sihan Yang, Linli Yao, Yuanxin Liu, Yuanxing Zhang, Pengfei Wan, Xu Sun

1 citation

Abstract

The image depicts a scene with three buses parked side by side on a gravel surface. The central bus is white with a black front and the lowercase word "eVolution" written on it in the center below the windshield. This bus has a British-style license plate that reads "AE56 UTH." It features sliding passenger doors with black rubber trimming and reflective windows, showing faint reflections of trees and sky. The bus to the left, partially visible, is also white with a license plate "AE57 FBC," and has a similar design with large reflective windows, but no visible "eVolution" text. The bus to the right is a double-decker, with only the top level visible, is mostly blue with a green and white pattern along the side. A few tall trees are visible in the background, and the sky is partly cloudy, with light blue tones and patches of orange and pink due to the setting sun. The image depicts a scene with two buses parked side by side on the ground. The foreground bus is white with a black front and the word "eVolution" written on it. It has a license plate that reads "AE56 UTH." The second bus, partially visible, is also white with a license plate "AE57 FBC." The background features a partly cloudy sky with hues of blue and white clouds, and some greenery is visible in the distance. The overall scene suggests a bus depot or a parking area for buses. The image depicts a scene with two white single-decker buses parked side by side on a gravel surface. The bus in the foreground has "eVolution" written on the front and a license plate reading "AE56 UTH." It features a black front entry door and large front-facing windows with twin windscreen wipers. Next to it is a similar bus with the license plate "AE57 FBC." In the background, a green and blue double-decker bus is partially visible behind the two. The sky above is partly cloudy with patches of blue, and trees can be seen in the distance, suggesting the setting is a bus depot or vehicle storage area. The image shows three buses parked side by side on the ground. The bus in the foreground is white with a black front and has the word "eVolution" written on it. Its license plate reads "AE56 UTH." To its left is a second bus, which is partially visible. It is also white, shares the same design as the first bus, and bears the license plate "AE57 FBC." Behind these two buses is a third one, which is blue. In the background, there is a partly cloudy sky with white clouds and a bright area on the right side. Some greenery is also visible in the distance. The overall scene suggests a bus depot or a parking area for buses. Original Caption (Generated by Qwen2-VL) GPT-4o Recaptioning RICO (Ours) Human Recaptioning Wrong or Ambiguous Information Other Added Details Corrected Version Details only Detected by RICO (Ours) Mismatch Area Correct Area Reconstruct to Image Reconstructed Image Figure 1 : Analysis of image captions generated by Qwen2-VL and its recaptioned variants. Despite the advanced capabilities of Qwen2-VL, the generated captions still contain incorrect or ambiguous information-for example, misidentifying the number of buses-a mistake that remains uncorrected even by GPT-4o. Furthermore, both GPT-4o and human-generated recaptions often overlook fine-grained details, such as attributes and spatial relationships, which are accurately captured by our model. By reconstructing images from captions, it becomes evident that our model better preserves such details, resulting in reconstructions that more closely resemble the original image.