ICLR2022

Transformer-based Transform Coding

Yinhao Zhu, Yang Yang, Taco Cohen

218 citations

Abstract

Neural data compression based on nonlinear transform coding has made great progress over the last few years, mainly due to improvements in prior models, quantization methods and nonlinear transforms. A general trend in many recent works pushing the limit of rate-distortion performance is to use ever more expensive prior models that can lead to prohibitively slow decoding. Instead, we focus on more expressive transforms that result in a better rate-distortioncomputation trade-off. Specifically, we show that nonlinear transforms built on Swin-transformers can achieve better compression efficiency than transforms built on convolutional neural networks (ConvNets), while requiring fewer parameters and shorter decoding time. Paired with a compute-efficient Channel-wise Auto-Regressive Model prior, our SwinT-ChARM model outperforms VTM-12.1 by 3.68% in BD-rate on Kodak with comparable decoding speed. In P-frame video compression setting, we are able to outperform the popular ConvNet-based scalespace-flow model by 12.35% in BD-rate on UVG. We provide model scaling studies to verify the computational efficiency of the proposed solutions and conduct several analyses to reveal the source of coding gain of transformers over Conv-Nets, including better spatial decorrelation, flexible effective receptive field, and more localized response of latent pixels during progressive decoding. * Equal contribution. † Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc. 1 In the extreme case when a latent-pixel-level spatial autoregressive prior is used, decoding of a single 512x768 image requires no less than 1536 interleaved executions of prior model inference and entropy decoding (assuming the latent is downsampled by a factor of 16x16). As main contributions, we 1) extend Swin-Transformer (Liu et al., 2021) to a decoder setting and build Swin-transformer based neural image codecs that attain better rate-distortion performance with lower complexity compared with existing solutions, 2) verify its effectiveness in video compression by enhancing scalespace-flow, a popular neural P-frame codec, and 3) conduct extensive analysis and ablation study to explore differences between convolution and transformers, and investigate potential source of coding gain. BACKGROUND & RELATED WORK Conv-Hyperprior The seminal hyperprior architecture (Ballé et al., 2018; Minnen et al., 2018) is a two-level hierarchical variational autoencoder, consisting of a pair of encoder/decoder g a , g s , and a pair of hyper-encoder/hyper-decoder h a , h s . Given an input image x, a pair of latent y = g a (x) and hyper-latent z = h a (y) is computed. The quantized hyper-latent ẑ = Q(z) is modeled and entropycoded with a learned factorized prior. The latent y is modeled with a factorized Gaussian distribution p(y|ẑ) = N (µ, diag(σ)) whose parameter is given by the hyper-decoder (µ, σ) = h s (ẑ). The quantized version of the latent ŷ = Q(y -µ) + µ is then entropy coded and passed through decoder g s to derive reconstructed image x = g s (ŷ). The tranforms g a , g s , h a , h s are all parameterized as ConvNets (for details, see Appendix A.1). Conv-ChARM (Minnen & Singh, 2020) extends the baseline hyperprior architecture with a channel-wise auto-regressive model (ChARM) 2 , in which latent y is split along channel dimension into S groups (denoted as y 1 , . . . , y S ), and the Gaussian prior p(y s |ẑ, ŷ<s ) is made autoregressive across groups where the mean/scale of y s depends on quantized latent in the previous groups ŷ<s . In practice, S = 10 provides a good balance of performance and complexity and is adopted here. Spatial AR models Most of recent performance advancements of neural image compression is driven by the use of spatial auto-regressive/context models. Variants include causal global prediction (Guo et al., 2021 ), 3D context (Ma et al., 2021), block-level context (Wu et al., 2020), nonlocal context (Li et al., 2020; Qian et al., 2021) . One common issue with these designs is that decoding cannot be parallelized along spatial dimensions, leading to impractical 3 decoding latency, especially for large resolution images.