ICLR2026

Reliable Evaluation of MRI Motion Correction: Dataset and Insights

Kun Wang, Tobit Klug, Stefan Ruschke, Jan Kirschke, Reinhard Heckel

1 citation

Abstract

Correcting motion artifacts in MRI is important, as they can hinder accurate diagnosis. However, evaluating deep learning-based and classical motion correction methods remains fundamentally difficult due to the lack of accessible ground-truth target data. To address this challenge, we study three evaluation approaches: real-world evaluation based on reference scans, simulated motion, and reference-free evaluation, each with its merits and shortcomings. To enable evaluation with real-world motion artifacts, we release PMoC3D, a dataset consisting of unprocessed Paired Motion-Corrupted 3D brain MRI data. To advance evaluation quality, we introduce MoMRISim, a feature-space metric trained for evaluating motion reconstructions. We assess each evaluation approach and find real-world evaluation together with MoMRISim, while not perfect, to be most reliable. Evaluation based on simulated motion systematically exaggerates algorithm performance, and reference-free evaluation overrates oversmoothed deep learning outputs. We consider standard metrics in the pixel-space such as SSIM [Wan+04] and PSNR [HZ10] and feature-space metrics such as DreamSim [Fu+23], and we additionally propose a feature-space metric MoMRISim that is trained to align with varying levels of motion severity. We find that, for scans with moderate to severe motion corruption, reference-based evaluation using feature-space metrics like MoMRISim correlates well with human judgments and provides a reliable measure of reconstruction quality. However, under mild motion, the motion-free reference reconstructions often retain residual artifacts, and in some cases, mildly motion-corrupted scans reconstructed with motion-correction methods appear visually cleaner than the reference. This challenges the reliability of reference-based evaluation in mild motion settings, where simulated data with known ground truth can offers a more meaningful alternative for evaluation. Second, we assess evaluation based on simulated motion corruption, and observe that some methods achieve almost error free reconstructions under the most severe simulated motion, whereas the same methods exhibit noticeable residual artifacts under severe real-world motion. This is consistent with findings for other imaging problems, that found simulated data to potentially lead to missleading conclusions [Shi+22] . Third, regarding reference-free evaluation, we propose and consider a vision-language model (VLM) score. While exhibiting a significantly better alignment with perceived image quality than classical gradient-based reference-free metrics, we find the VLM score to be biased towards reconstructions, which are overly smooth but potentially miss anatomical details. In summary, all three considered evaluation methods have shortcomings, but evaluation on real-world paired datasets such as PMoC3D, when combined with an appropriate feature-based metric such as MoMRISim, provides a relatively reliable and meaningful assessment of reconstruction performance under moderate to severe motion. The PMoC3D Dataset for real-world evaluation In order to evaluate accelerated 3D motion correction methods, we constructed the PMoC3D dataset, described in this section. PMoC3D is a 3D dataset containing the raw measurement data of scans L1(S7 3) PMAS: -0.23 L1(S5 2) PMAS: 0.36 L1(S6 2) PMAS: 1.04 L1(S8 2) PMAS: 1.40 L1(S2 1) PMAS: 2.42