US 12,236,337 B2
	Methods and systems for compressing a trained neural network and for improving efficiently performing computations of a compressed neural network
Marziehsadat Tahaei, Montreal (CA); Ali Ghodsi, Waterloo (CA); Mehdi Rezagholizadeh, Montreal (CA); and Vahid Partovi Nia, Montreal (CA)
Assigned to HUAWEI TECHNOLOGIES CO., LTD., Shenzhen (CN)
Filed by Marziehsadat Tahaei, Montreal (CA); Ali Ghodsi, Waterloo (CA); Mehdi Rezagholizadeh, Montreal (CA); and Vahid Partovi Nia, Montreal (CA)
Filed on May 17, 2021, as Appl. No. 17/322,674.
Prior Publication US 2022/0366226 A1, Nov. 17, 2022
Int. Cl. G06N 3/04 (2023.01); G06N 3/063 (2023.01); G06N 3/08 (2023.01)

CPC G06N 3/063 (2013.01) [G06N 3/04 (2013.01); G06N 3/08 (2013.01)]

20 Claims

1. A method for compressing a neural network which performs an inference task, the method comprising:

obtaining a batch of data samples from a training dataset, each data sample comprising input data and a respective ground-truth label;

inputting the input data of the data samples of the batch into a trained neural network to forward propagate the input data of the data samples of the batch through the neural network and generate neural network predictions for the input data of the data samples of the batch;

inputting the input data of the data samples of the batch into a Kronecker neural network to forward propagate the input data of the data samples of the batch through the Kronecker neural network to generate Kronecker predictions for the input data of the data samples of the batch;

computing a knowledge distillation loss based on outputs generated by a layer of the neural network and a corresponding Kronecker layer of the Kronecker neural network, wherein the Kronecker layer has a first parameter matrix and a second parameter matrix storing learnable parameters of the Kronecker layer of the Kronecker neural network, and the output of the Kronecker layer is generated based on tiled matrix multiplication of a Kronecker product of the first parameter matrix and the second parameter matrix and a plurality of tiles of an input matrix, each tile comprising a respective subset matrix of the input matrix;

computing a loss for the Kronecker neural network based on the Kronecker neural network predictions and ground-truth labels of the data samples of the batch;

combining the knowledge distillation loss and the loss for the Kronecker neural network to generate a total loss for the Kronecker neural network; and

back propagating the total loss through the Kronecker neural network to adjust values of learnable parameters of the Kronecker neural network.