CVPR2023
Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention
Xuran Pan, Tianzhu Ye, Zhuofan Xia, Shiji Song, Gao Huang
摘要
We summarize the architectures of five Transformer models adopted in the main paper, including PVT [11], PVTv2 [12], Swin Transformer [8], CSwin Transformer [3], NAT [4] in Tab.5-10. For fair comparison, we only substitute the original self-attention blocks at early stages of the baseline models with our proposed Slide Attention, while the remaining blocks, training configurations, and model structure (width and depth) are kept unchanged.