ICLR2026
Reducing information dependency does not cause training data privacy. Adversarially non-robust features do.
Rasmus Torp, Shailen Smith, Adam Breuer
摘要
In this paper, we challenge the prevailing view that information dependency (including rote memorization) drives training data exposure to image reconstruction attacks. We show that extensive exposure can persist without rote memorization and is instead caused by a tunable connection to adversarial robustness. We begin by presenting three surprising results: (1) recent defenses that inhibit reconstruction by Model Inversion Attacks (MIAs), which evaluate leakage under an idealized attacker, do not reduce standard measures of information dependency (HSIC); (2) models that maximally memorize their training datasets remain robust to MIA reconstruction; and (3) models trained without seeing 97% of the training pixels, where recent information-theoretic bounds give arbitrarily strong privacy guarantees under standard assumptions, can still be devastatingly reconstructed by MIA. To explain these findings, we provide causal evidence that privacy under MIA arises from what the adversarial examples literature calls "non-robust" features (generalizable but imperceptible and unstable features). We further show that recent MIA defenses obtain their privacy improvements by unintentionally shifting models toward such features. To establish this causal relationship, we introduce Anti Adversarial Training (AT-AT ), a training regime that intentionally learns non-robust features to obtain both superior reconstruction defense and higher accuracy than state-of-the-art defenses. Our results revise the prevailing understanding of training data exposure and reveal a new privacy-robustness tradeoff. * Equal contribution. † Breuer Lab gratefully acknowledges the support of the OpenAI Cybersecurity Grant. Replication code is available at https://github.com/BreuerLabs/Anti-Adversarial-Training Published as a conference paper at ICLR 2026 tion domain. Unlike more conservative attacks that merely probe for the presence of some exposed training examples, MIAs test the degree to which a powerful attacker armed with white-box model access and significant computational and data resources can reconstruct information from the training examples. While early MIAs applied SGD to directly optimize reconstructions of individual training examples (Fredrikson et al., 2015) , contemporary MIAs use sophisticated gradient techniques and external data to attempt to infer the full set of class-level characteristics for each class in the target model's training data (Qiu et al., 2024; Struppek et al., 2022; Haim et al., 2022) . However, while MIAs lower-bound the extent to which contemporary vision models expose their training data to reconstruction, they do not explain what aspect of a model encodes this vulnerability, or how to prevent it. This raises a fundamental question: What properties of learned representations encode vulnerability to training data leakage and reconstruction, and how can they be controlled? ⋄ ⋄ ⋄ Understanding these properties has far-reaching implications for learning that extend beyond the privacy domain. For example, influential recent work conjectures that obtaining stronger learning performance may require more extensive model-to-training-data dependencies, including rote memorization and exposure of the training set, not only for vanilla CNNs (