CVPR2023

Masked Images Are Counterfactual Samples for Robust Fine-Tuning

Yao Xiao, Ziyi Tang, Pengxu Wei, Cong Liu, Liang Lin

Abstract

For fine-tuning on ImageNet via vanilla fine-tuning or our approach, we use the AdamW optimizer [8] with β 1 = 0.9, β 2 = 0.999, weight decay of 0.1 and gradient clipping at ℓ 2 -norm 1. We use a batch size of 512, and fine-tune for 10 epochs. The learning rate is set to 3 × 10 -5 for all parameters and follows a cosine-annealing schedule [7] with 500 warm-up steps. For both training and testing, we resize and center-crop the images to the size of 224 × 224, and no data augmentation is applied. Besides, different from WiSE-FT [16], we do not use label smoothing.