ICLR2025

Memory Efficient Transformer Adapter for Dense Predictions

Dong Zhang, Rui Yan, Pingcheng Dong, Kwang-Ting Cheng

Abstract

Motivations & Solution The ViT adapter has emerged as a pivotal methodology for extracting vision-specific inductive biases from pre-trained ViT models[1], effectively mitigating the limitations inherent in the conventional pre-training followed by fine-tuning paradigm. While existing ViT adapters have demonstrated notable accuracy in vision tasks, their inference efficiency is substantially compromised by suboptimal memory access patterns[2], particularly due to operations such as standard normalization and frequent tensor reshaping. Our approach aims to mitigate memory access bottlenecks by strategically curtailing the dependence on layer normalization and significantly reducing frequent reshaping operations, ultimately enhancing the inference speed in downstream tasks.