ICLR2026

Bayesian Post Training Enhancement of Regression Models with Calibrated Rankings

Kevin Tirta Wijaya, Bing Xu Hu, Hans-peter Seidel, Wojciech Matusik, Vahid Babaei

Abstract

Accurate regression models are essential for scientific discovery, yet high-quality numeric labels are scarce and expensive. In contrast, rankings (especially pairwise) are easier to obtain from domain experts or artificial intelligence judges. We introduce RANKREFINE++, a novel plug-and-play method that improves a base regressor's prediction for a query by leveraging pairwise rankings between the query and reference items with known labels. RANKREFINE++ performs a Bayesian update that combines a Gaussian likelihood from the regressor and the Bradley-Terry likelihood from the ranker. This yields a strictly log-concave posterior with a unique maximum likelihood solution and fast Newton updates. We show that prior state-of-the-art is a special case of our framework, and we identify a fundamental failure mode: Bradley-Terry likelihoods suffer from scale mismatch and curvature dominance when the number of reference items is large, which can degrade performance. From this analysis, we derive a calibration method to adjust the information originating from the expert rankings. RANKRE-FINE++ shows a stunning 97.65% median improvement across 12 datasets over previous state-of-the-art method using a realistically-accurate ranker, and runs efficiently on a consumer-grade CPU. Published as a conference paper at ICLR 2026 An additional key insight is that the default Bradley-Terry model can be mismatched to the behavior of pairwise rankers, leading to biased estimates and increased errors. RANKREFINE++ addresses this with (i) a learned temperature that calibrates the sigmoid slope of the Bradley-Terry model, and (ii) an accuracy-aware soft gate that regulates the influence of the ranker. Specifically, our contributions are: 1. We introduce RANKREFINE++ which combines a regressor likelihood with a calibrated ranker likelihood to enhance predictions without retraining. We prove that the state-of-theart, RankRefine (Wijaya et al., 2025) , is a special case under the Gaussian assumption. 2. We show how an uncalibrated Bradley-Terry model can bias the estimates when ranker likelihood dominates, and we provide a temperature calibration with accuracy-aware soft gating mechanism to mitigate the issue. 3. Across 12 cross-domain datasets (including real-world molecular datasets), RANKRE-FINE++ significantly improves over existing post-training enhancement methods (Wijaya et al. (2025); Yan et al. (2024), and 3 other baselines) and remains effective when using imperfect LLM rankers. Specifically, RANKREFINE++ achieves 19.33% median MAE reduction relying on 30 reference samples and a ranker with 65% accuracy, which translates to a stunning 97.65% relative improvement compared to RankRefine's 9.78% median MAE reduction. Together, these results position RANKREFINE++ as a practical and principled way to leverage readily-available pairwise information to enhance scalar regressor in data-scarce domains. Source code is available at https://github.com/ktirta/regref .