CVPR2022

Robust fine-tuning of zero-shot models

Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, Ludwig Schmidt

被引用 364 次

DOI 出版方

摘要

Large pre-trained models such as CLIP or ALIGN offer consistent accuracy across a range of data distributions when performing zero-shot inference (i.e., without fine-tuning on a specific dataset). Although existing fine-tuning methods substantially improve accuracy on a given target distribution, they often reduce robustness to distribution shifts. We address this tension by introducing a simple and effective method for improving robustness while fine-tuning: ensembling the weights of the zero-shot and fine-tuned models (WiSE-FT). Compared to standard fine-tuning, WiSE-FT provides large accuracy improvements under distribution shift, while preserving high accuracy on the target distribution. On ImageNet and five derived distribution shifts, WiSE-FT improves accuracy under distribution shift by 4 to 6 percentage points (pp) over prior work while increasing ImageNet accuracy by 1.6 pp. WiSE-FT achieves similarly large robustness gains (2 to 23 pp) on a diverse set of six further distribution shifts, and accuracy gains of 0.8 to 3.3 pp compared to standard fine-tuning on commonly used transfer learning datasets. These improvements come at no additional computational cost during fine-tuning or inference. Accuracy on the reference distribution (e.g., ImageNet) Accuracy on the distribution shifts M od el s tr ai ne d on re fe re nc e di st ri bu ti on tr ai n se t Z e ro -s h o t C L IP m o d e ls Effective robustness Fine-tuned CLIP Schematic: fine-tuning CLIP on the reference distribution leads to higher accuracy on the reference distribution but less robustness Accuracy on the reference distribution (e.g., ImageNet) Accuracy on the distribution shifts M od el s tr ai ne d on re fe re nc e di st ri bu ti on tr ai n se t Z e ro -s h o t C L IP m o d e ls Weight-space ensemble for α ∈ [0, 1]: θ α = (1α) • θ zero-shot + α • θ fine-tuned θ zero-shot θ fine-tuned Schematic: our method, WiSE-FT leads to better accuracy on the distribution shifts without decreasing accuracy on the reference distribution Var yingamixingac o e f f i c ie ntaα 55 60 65 70 75 80 85