CVPR2023

Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention

Xuran Pan, Tianzhu Ye, Zhuofan Xia, Shiji Song, Gao Huang

摘要

We summarize the architectures of five Transformer models adopted in the main paper, including PVT [11], PVTv2 [12], Swin Transformer [8], CSwin Transformer [3], NAT [4] in Tab.5-10. For fair comparison, we only substitute the original self-attention blocks at early stages of the baseline models with our proposed Slide Attention, while the remaining blocks, training configurations, and model structure (width and depth) are kept unchanged.