CVPR2025

Breaking the Low-Rank Dilemma of Linear Attention

Qihang Fan, Huaibo Huang, Ran He

Abstract

The Softmax attention mechanism in Transformer models is notoriously computationally expensive due to its quadratic complexity, posing significant challenges in vision applications. In contrast, linear attention offers a far more efficient solution by reducing the complexity to linear levels. However, linear attention often suffers significant performance degradation compared to Softmax attention. Our experiments indicate that this performance drop stems from the low-rank nature of linear attention's output feature map, which hinders its ability to adequately model complex spatial information. To address this lowrank dilemma, we conduct rank analysis from two perspectives: the KV buffer and the output features. Consequently, we introduce Rank-Augmented Linear Attention (RALA), which rivals the performance of Softmax attention while maintaining linear complexity and high efficiency. Building upon RALA, we construct the Rank-Augmented Vision Linear Transformer (RAVLT). Extensive experiments demonstrate that RAVLT achieves excellent performance across various vision tasks. Specifically, without using any additional labels, data, or supervision during training, RAVLT achieves an 84.4% Top-1 accuracy on ImageNet-1k with only 26M parameters and 4.6G FLOPs. This result significantly surpasses previous linear attention mechanisms, fully illustrating the potential of RALA. Code will be available at https://github.com/qhfan/RALA .