ICML2023
Few-bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction
Georgii Sergeevich Novikov, Daniel Bershatsky, Julia Gusak, Alex Shonenkov, Denis Valerievich Dimitrov, Ivan V. Oseledets
被引用 19 次
摘要
Memory footprint is one of the main limiting factors for large neural network training. In backpropagation, one needs to store the input to each operation in the computational graph. Every modern neural network model has quite a few pointwise nonlinearities in its architecture, and such operation induces additional memory costs which -as we show -can be significantly reduced by quantization of the gradients. We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear functions with only a few bits per each element. We show that such approximation can be achieved by computing optimal piecewiseconstant approximation of the derivative of the activation function, which can be done by dynamic programming. The drop-in replacements are implemented for all popular nonlinearities and can be used in any existing pipeline. We confirm the memory reduction and the same convergence on several open benchmarks. The proposed approximation divides all values into several bins and saves only its corresponding bin indices instead of storing all values. This is a lossy compresion, but the additional noise introduced by it is negligible as we will show on several benchmarks 4. Main contributions of our paper are: • We propose new approximate backward computation