ICLR2026

Joint Distribution–Informed Shapley Values for Sparse Counterfactual Explanations

Lei You, Yijun Bian, Lele Cao

3 citations

Abstract

Counterfactual explanations (CE) aim to reveal how small input changes flip a model's prediction, yet many methods modify more features than necessary, reducing clarity and actionability. We introduce COLA, a model-and generatoragnostic post-hoc framework that refines any given CE by computing a coupling via optimal transport (OT) between factual and counterfactual sets and using it to drive a Shapley-based attribution (p-SHAP) that selects a minimal set of edits while preserving the target effect. Theoretically, OT minimizes an upper bound on the W 1 divergence between factual and counterfactual outcomes and that, under mild conditions, refined counterfactuals are guaranteed not to move farther from the factuals than the originals. Empirically, across four datasets, twelve models, and five CE generators, COLA achieves the same target effects with only 26-45% of the original feature edits. On a small-scale benchmark, COLA shows near-optimality. Experiment code: https://github.com/youlei202/XAI-COLA . Software (Zhu & You, 2026): https://pypi.org/project/xai-cola/ . BACKGROUND Explainable Artificial Intelligence (XAI) is essential for making artificial intelligence systems transparent and trustworthy (Arrieta et al., 2020; Das & Rad, 2020) . Within this area, feature attributions (FA) methods, such as Shapley values (Sundararajan & Najmi, 2020; Lundberg & Lee, 2017), determine how much each input feature contributes to a machine learning (ML) model's output. This helps simplify complex models by highlighting the most influential features. For example, in a healthcare model, Shapley values can identify key factors like age and medical history, assisting clinicians in understanding the model's decisions (Ter-Minassian et al., 2023; Nohara et al., 2022) . Another technique counterfactual explanations (CE) (Wachter et al., 2017; Guidotti, 2022) show how small changes in input features can lead to different outcomes. While hundreds of CE algorithms have been proposed (Guidotti, 2022; Verma et al., 2020) to date, it is hardly practical to find one single CE algorithm that suits for all user cases, due to each of them is tailored particularly for their own different scenarios, goals, and tasks. For instance, the objective in one CE algorithm can be defined as finding a single counterfactual instance for each factual instance sometimes, while at othertimes, it could be treating multiple instances as a group and seeking one or more/multiple counterfactual instances for each factual observation. In some cases, the focus of a CE algorithm could be on the entire dataset, aiming to identify global CE that indicate the direction to move the factual instances to achieve the desired model output (Rawal & Lakkaraju, 2020; Ley et al., 2022; 2023; Carrizosa et al., 2024) . Yet in other scenarios, the group of factual instances is viewed as a distribution, aiming to find a counterfactual distribution that remains similar in shape to the factual distribution (You et al., *