| CPC H03M 7/3059 (2013.01) [G06F 18/214 (2023.01); G06N 3/04 (2013.01); G06N 3/063 (2013.01); G06N 3/084 (2013.01); G06V 10/771 (2022.01); H03M 7/702 (2013.01)] | 12 Claims |

|
1. A method for compressing a neural network model for deployment on a terminal, executable by a processor, comprising:
reshaping, for a layer in a deep neural network model, a weight tensor having a first dimension into a reshaped weight tensor having a second dimension, the second dimension being less than the first dimension,
wherein a size of the reshaped weight tensor is based on a number of input channels, a number of output channels, and an axis along which the weight tensor is reshaped;
partitioning, for the layer in the deep neural network model, the reshaped weight tensor into one or more blocks;
averaging, for the layer in the deep neural network model, weights within respective blocks of the one or more blocks;
ranking, for the layer in the deep neural network model, the one or more blocks of the reshaped weight tensor based on a loss associated with the respective blocks;
fixing, for the layer in the deep neural network model, the averaged weights within respective blocks of the one or more blocks for a predetermined number of ranked blocks and setting a respective item corresponding to a respective block in a quantization mask as a fixed value based on the average weight of the respective block;
training the deep neural network model based on updating un-fixed weights associated with a remaining number of ranked blocks;
compressing the deep neural network model, for each layer in the deep neural network model, based on the averaged weights for respective layers of the neural network.
|