ICLR2025
Memory Efficient Transformer Adapter for Dense Predictions
Dong Zhang, Rui Yan, Pingcheng Dong, Kwang-Ting Cheng
摘要
Motivations & Solution The ViT adapter has emerged as a pivotal methodology for extracting vision-specific inductive biases from pre-trained ViT models[1], effectively mitigating the limitations inherent in the conventional pre-training followed by fine-tuning paradigm. While existing ViT adapters have demonstrated notable accuracy in vision tasks, their inference efficiency is substantially compromised by suboptimal memory access patterns[2], particularly due to operations such as standard normalization and frequent tensor reshaping. Our approach aims to mitigate memory access bottlenecks by strategically curtailing the dependence on layer normalization and significantly reducing frequent reshaping operations, ultimately enhancing the inference speed in downstream tasks.