US 12,010,310 B2
Method and apparatus for quantization, adaptive block partitioning and codebook coding for neural network model compression
Wei Wang, Palo Alto, CA (US); Wei Jiang, San Jose, CA (US); and Shan Liu, San Jose, CA (US)
Assigned to TENCENT AMERICA LLC, Palo Alto, CA (US)
Filed by TENCENT AMERICA LLC, Palo Alto, CA (US)
Filed on Dec. 22, 2021, as Appl. No. 17/559,676.
Application 17/559,676 is a continuation of application No. 17/099,202, filed on Nov. 16, 2020, granted, now 11,245,903.
Claims priority of provisional application 62/939,949, filed on Nov. 25, 2019.
Claims priority of provisional application 62/939,054, filed on Nov. 22, 2019.
Claims priority of provisional application 62/939,057, filed on Nov. 22, 2019.
Prior Publication US 2022/0116610 A1, Apr. 14, 2022
Int. Cl. H04N 19/13 (2014.01); G06N 3/08 (2023.01); H04N 19/119 (2014.01); H04N 19/124 (2014.01); H04N 19/147 (2014.01); H04N 19/176 (2014.01); H04N 19/192 (2014.01); H04N 19/30 (2014.01); H04N 19/46 (2014.01); H04N 19/597 (2014.01); H04N 19/96 (2014.01)
CPC H04N 19/124 (2014.11) [G06N 3/08 (2013.01); H04N 19/119 (2014.11); H04N 19/13 (2014.11); H04N 19/147 (2014.11); H04N 19/176 (2014.11); H04N 19/192 (2014.11); H04N 19/30 (2014.11); H04N 19/46 (2014.11); H04N 19/597 (2014.11); H04N 19/96 (2014.11)] 20 Claims
OG exemplary drawing
 
1. A method of quantization, adaptive block partitioning and codebook coding for neural network model compression, the method being performed by at least one processor, and the method comprising:
determining a saturated maximum value of a multi-dimensional tensor in a layer of a neural network, and a bit depth corresponding to the saturated maximum value;
clipping weight coefficients in the multi-dimensional tensor to be within a range of the saturated maximum value;
quantizing the clipped weight coefficients, based on the bit depth;
transmitting, to a decoder, a layer header comprising the bit depth and a plurality of partitioning parameters that specify at least a height threshold and a width threshold;
reshaping a four-dimensional (4D) parameter tensor of a neural network, among the quantized weight coefficients, into a three-dimensional (3D) parameter tensor of the neural network, the 3D parameter tensor comprising a convolution kernal size, an input feature size and an output feature size; and
partitioning the 3D parameter tensor based on the height threshold and the width threshold.