CPC H04N 19/124 (2014.11) [G06N 3/08 (2013.01); H04N 19/119 (2014.11); H04N 19/13 (2014.11); H04N 19/147 (2014.11); H04N 19/176 (2014.11); H04N 19/192 (2014.11); H04N 19/30 (2014.11); H04N 19/46 (2014.11); H04N 19/597 (2014.11); H04N 19/96 (2014.11)] | 20 Claims |
1. A method of quantization, adaptive block partitioning and codebook coding for neural network model compression, the method being performed by at least one processor, and the method comprising:
determining a saturated maximum value of a multi-dimensional tensor in a layer of a neural network, and a bit depth corresponding to the saturated maximum value;
clipping weight coefficients in the multi-dimensional tensor to be within a range of the saturated maximum value;
quantizing the clipped weight coefficients, based on the bit depth;
transmitting, to a decoder, a layer header comprising the bit depth and a plurality of partitioning parameters that specify at least a height threshold and a width threshold;
reshaping a four-dimensional (4D) parameter tensor of a neural network, among the quantized weight coefficients, into a three-dimensional (3D) parameter tensor of the neural network, the 3D parameter tensor comprising a convolution kernal size, an input feature size and an output feature size; and
partitioning the 3D parameter tensor based on the height threshold and the width threshold.
|