CVPR2024

Mean-Shift Feature Transformer

Takumi Kobayashi

Abstract

Transformer models developed in NLP make a great impact on computer vision fields, producing promising performance on various tasks. While multi-head attention, a characteristic mechanism of the transformer, attracts keen research interest such as for reducing computation cost, we analyze the transformer model from a viewpoint of feature transformation based on a distribution of input feature tokens. The analysis inspires us to derive a novel transformation method from mean-shift update which is an effective gradient ascent to seek a local mode of distinctive representation on the token distribution. We also present an efficient projection approach to reduce parameter size of linear projections constituting the proposed multi-head feature transformation. In the experiments on ImageNet-1K dataset, the proposed methods embedded into various network models exhibit favorable performance improvement in place of the transformer module. Codes are available at https://github.com/tk1980/MSFtransformer .