ICLR2026
Soft Equivariance Regularization for Invariant Self-Supervised Learning
Joohyung Lee, Changhun Kim, Hyunsu Kim, Kwanhyung Lee, Juho Lee
1 citation
Abstract
Self-supervised learning (SSL) typically learns representations invariant to semantic-preserving augmentations (e.g., random crops and photometric jitter). While effective for recognition, enforcing strong invariance can suppress transformation-dependent structure that is useful for robustness to geometric perturbations and spatially sensitive transfer. A growing body of work, therefore, augments invariance-based SSL with equivariance objectives, but these objectives are often imposed on the same final (typically spatially-collapsed) representation. We empirically observe a trade-off in this coupled setting: pushing equivariance regularization toward deeper layers improves equivariance scores but degrades ImageNet-1k linear evaluation, motivating a layer-decoupled design. Motivated by this trade-off, we propose Soft Equivariance Regularization (SER), a plug-in regularizer that decouples where invariance and equivariance are enforced: we keep the base SSL objective unchanged on the final embedding, while softly encouraging equivariance on an intermediate spatial token map via analytically specified group actions (e.g., rotations, flips, and scaling) applied directly in feature space. SER learns/predicts no per-sample transformation codes/labels, requires no auxiliary transformation-prediction head, and adds only 1.008 training FLOPs. On ImageNet-1k ViT-S/16 pretraining, SER improves MoCo-v3 by +0.84 Top-1 in linear evaluation under a strictly matched 2-view setting and consistently improves DINO and Barlow Twins; under matched view counts, SER achieves the best ImageNet-1k linear-eval Top-1 among the compared invariance+equivariance add-ons. SER further improves ImageNet-C/P by +1.11/+1.22 Top-1 and frozen-backbone COCO detection by +1.7 mAP. Finally, applying the same layer-decoupling recipe to existing invariance+equivariance baselines (e.g., EquiMod and AugSelf) improves their accuracy, suggesting layer decoupling as a general design principle for combining invariance and equivariance. Code is available at https://github.com/aitrics-chris/SER.