US 12,362,764 B2
Neural network model compression with quantizability regularization
Wei Jiang, San Jose, CA (US); Wei Wang, Palo Alto, CA (US); and Shan Liu, San Jose, CA (US)
Assigned to TENCENT AMERICA LLC, Palo Alto, CA (US)
Filed by TENCENT AMERICA LLC, Palo Alto, CA (US)
Filed on Oct. 19, 2020, as Appl. No. 17/073,602.
Claims priority of provisional application 62/954,472, filed on Dec. 28, 2019.
Prior Publication US 2021/0201157 A1, Jul. 1, 2021
Int. Cl. H03M 7/30 (2006.01); G06F 18/214 (2023.01); G06N 3/04 (2023.01); G06N 3/063 (2023.01); G06N 3/084 (2023.01); G06V 10/771 (2022.01)
CPC H03M 7/3059 (2013.01) [G06F 18/214 (2023.01); G06N 3/04 (2013.01); G06N 3/063 (2013.01); G06N 3/084 (2013.01); G06V 10/771 (2022.01); H03M 7/702 (2013.01)] 12 Claims
OG exemplary drawing
 
1. A method for compressing a neural network model for deployment on a terminal, executable by a processor, comprising:
reshaping, for a layer in a deep neural network model, a weight tensor having a first dimension into a reshaped weight tensor having a second dimension, the second dimension being less than the first dimension,
wherein a size of the reshaped weight tensor is based on a number of input channels, a number of output channels, and an axis along which the weight tensor is reshaped;
partitioning, for the layer in the deep neural network model, the reshaped weight tensor into one or more blocks;
averaging, for the layer in the deep neural network model, weights within respective blocks of the one or more blocks;
ranking, for the layer in the deep neural network model, the one or more blocks of the reshaped weight tensor based on a loss associated with the respective blocks;
fixing, for the layer in the deep neural network model, the averaged weights within respective blocks of the one or more blocks for a predetermined number of ranked blocks and setting a respective item corresponding to a respective block in a quantization mask as a fixed value based on the average weight of the respective block;
training the deep neural network model based on updating un-fixed weights associated with a remaining number of ranked blocks;
compressing the deep neural network model, for each layer in the deep neural network model, based on the averaged weights for respective layers of the neural network.