CVPR2023

Boost Vision Transformer with GPU-Friendly Sparsity and Quantization

Chong Yu, Tao Chen, Zhongxue Gan, Jiayuan Fan

摘要

The transformer extends its success from the language to the vision domain. Because of the stacked self-attention and cross-attention blocks, the acceleration deployment of vision transformer on GPU hardware is challenging and also rarely studied. This paper thoroughly designs a compression scheme to maximally utilize the GPU-friendly 2:4 finegrained structured sparsity and quantization. Specially, an original large model with dense weight parameters is first pruned into a sparse one by 2:4 structured pruning, which considers the GPU's acceleration of 2:4 structured sparse pattern with FP16 data type, then the floating-point sparse model is further quantized into a fixed-point one by sparsedistillation-aware quantization aware training, which considers GPU can provide an extra speedup of 2:4 sparse calculation with integer tensors. A mixed-strategy knowledge distillation is used during the pruning and quantization process. The proposed compression scheme is flexible to support supervised and unsupervised learning styles. Experiment results show GPUSQ-ViT scheme achieves state-ofthe-art compression by reducing vision transformer models 6.4-12.7× on model size and 30.3-62× on FLOPs with negligible accuracy degradation on ImageNet classification, COCO detection and ADE20K segmentation benchmarking tasks. Moreover, GPUSQ-ViT can boost actual deployment performance by 1.39-1.79× and 3.22-3.43× of latency and throughput on A100 GPU, and 1.57-1.69× and 2.11-2.51× improvement of latency and throughput on AGX Orin. Sparse M✕N✕K GEMM Dense M✕N✕K GEMM K A matrix (Dense) ☓ Accumulator (result) N Dense operation on Tensor Core M B matrix (Dense) C matrix (Dense) M K K/2 A matrix (Sparse) Non-zero data values 2-bits indices K/2 ☓ Accumulator (result) Sparse operation on Tensor Core Select B matrix (Dense) C matrix (Dense) N Choose matching K/2 elements out of K elements M M K