CVPR2025

The Power of Context: How Multimodality Improves Image Super-Resolution

Kangfu Mei, Hossein Talebi, Mojtaba Ardakani, Vishal M. Patel, Peyman Milanfar, Mauricio Delbracio

Abstract

Inputs Outputs Reference A close-up of a male lion with a dark mane, light tan face, and pink tongue sticking out . . . LR LR (Zoomed) Caption PASD SeeSR MMSR (Ours) HR Depth Segmentation Edge PASD (Zoomed) SeeSR (Zoomed) MMSR (Zoomed) HR (Zoomed) Figure 1. Our Multimodal Super-Resolution (MMSR) method leverages the rich context of multimodal guidance, including image captions, depth maps, semantic segmentation maps, and edges inferred from LR. MMSR surpasses state-of-the-art methods by producing more realistic results and suppressing artifacts that, while plausible, are inconsistent with the information present in the LR input.