SIGMOD2025
Understanding the Black Box: A Deep Empirical Dive into Shapley Value Approximations for Tabular Data
Suchit Gupte, John Paparrizos
18 citations
Abstract
Understanding the decisions made by machine learning models is significant for building trust and enabling the adoption of these models in real-world applications. Shapley values have emerged as a leading method for model interpretability, offering precise insights by quantifying each feature's contribution to predictions. However, computing Shapley values requires exploring all possible combinations of features, which can be computationally expensive, especially for high-dimensional data. This challenge has led to the development of various approximation techniques, often composed of estimation and replacement strategies, to compute the Shapley values efficiently. Our study focuses on the interpretability of machine learning models for tabular datasets, one of the most common and widely used data type. However, the abundance of options has created a substantial gap in determining the most appropriate technique for practical applications. Through this study, we seek to bridge this gap by comprehensively evaluating Shapley value approximations, covering 8 replacement and 17 estimation strategies across diverse regression and classification tasks. The evaluation is conducted exclusively on tabular data, leveraging 200 synthetic and real-world datasets, covering a wide range of model types, from conventional tree-based and linear models to modern neural networks. We focus on computational efficiency and the consistency of Shapley value estimates in handling high-dimensional feature spaces. Our findings reveal that traditional sampling-based approaches significantly reduce computational costs but fail to capture complex feature interactions. On the contrary, model-specific approaches that exploit the structure of the underlying model consistently outperform model-agnostic techniques, delivering higher accuracy and faster computations. Through the study, we aim to encourage further research on Shapley value approximations, advancing data-centric explainable AI.