CVPR2025

FLAVC: Learned Video Compression with Feature Level Attention

Chun Zhang, Heming Sun, Jiro Katto

摘要

Learned Video Compression (LVC) aims to reduce redundancy in sequential data through deep learning approaches. Recent advances have significantly boosted LVC performance by shifting compression operations to the feature domain, often combining Motion Estimation and Motion Compensation modules (MEMC) with CNN-based context extraction. However, reliance on motions and convolutiondriven context models limits generalizability and global perception. To address these issues, we propose a Featurelevel Attention (FLA) module within a Transformer-based framework that explicitly perceives full-frame, thus bypassing confined motion signatures. FLA accomplishes global perception by converting high-level local patch embeddings into one-dimensional batch-wise vectors and replacing traditional attention weights to a global context matrix. Additionally, a dense overlapping patcher (DP) is introduced to retain local features before embedding projection. Furthermore, a Transformer-CNN mixed encoder is applied to alleviate the spatial feature bottleneck without increasing latent size. Experiments demonstrate excellent generalizability with universally efficient redundancy reduction in different scenarios. Extensive tests on four video compression datasets show that our method achieves state-of-theart Rate-Distortion performance compared to existing LVC methods and traditional codecs. A down-scaled version of our model reduced computation overhead by a great margin while maintained strong performance. The code is available at https://github.com/Z-CV-code/FLAVC .