CVPR2022

Robust fine-tuning of zero-shot models

Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, Ludwig Schmidt

364 citations

Abstract

Large pre-trained models such as CLIP or ALIGN offer consistent accuracy across a range of data distributions when performing zero-shot inference (i.e., without fine-tuning on a specific dataset). Although existing fine-tuning methods substantially improve accuracy on a given target distribution, they often reduce robustness to distribution shifts. We address this tension by introducing a simple and effective method for improving robustness while fine-tuning: ensembling the weights of the zero-shot and fine-tuned models (WiSE-FT). Compared to standard fine-tuning, WiSE-FT provides large accuracy improvements under distribution shift, while preserving high accuracy on the target distribution. On ImageNet and five derived distribution shifts, WiSE-FT improves accuracy under distribution shift by 4 to 6 percentage points (pp) over prior work while increasing ImageNet accuracy by 1.6 pp. WiSE-FT achieves similarly large robustness gains (2 to 23 pp) on a diverse set of six further distribution shifts, and accuracy gains of 0.8 to 3.3 pp compared to standard fine-tuning on commonly used transfer learning datasets. These improvements come at no additional computational cost during fine-tuning or inference. Accuracy on the reference distribution (e.g., ImageNet) Accuracy on the distribution shifts M od el s tr ai ne d on re fe re nc e di st ri bu ti on tr ai n se t Z e ro -s h o t C L IP m o d e ls Effective robustness Fine-tuned CLIP Schematic: fine-tuning CLIP on the reference distribution leads to higher accuracy on the reference distribution but less robustness Accuracy on the reference distribution (e.g., ImageNet) Accuracy on the distribution shifts M od el s tr ai ne d on re fe re nc e di st ri bu ti on tr ai n se t Z e ro -s h o t C L IP m o d e ls Weight-space ensemble for α ∈ [0, 1]: θ α = (1α) • θ zero-shot + α • θ fine-tuned θ zero-shot θ fine-tuned Schematic: our method, WiSE-FT leads to better accuracy on the distribution shifts without decreasing accuracy on the reference distribution Var yingamixingac o e f f i c ie ntaα 55 60 65 70 75 80 85