| CPC G06N 3/04 (2013.01) [G06N 3/08 (2013.01)] | 26 Claims |

|
1. A computer-implemented method for quantizing tensors of a neural network model comprising multiple processing layers, comprising:
computing first clipping scalars for quantizing first tensors of a first processing layer that is coupled between two processing layers of the multiple processing layers;
processing an input by the neural network model, according to quantized tensors that include the quantized first tensors, by each processing layer of the multiple processing layers in sequence to produce intermediate tensors and an output of the neural network model;
adjusting the first tensors based on a loss gradient; and
updating the first clipping scalars based on a mean squared error to reduce differences between the adjusted first tensors and quantized adjusted first tensors.
|