| CPC G06N 3/08 (2013.01) [G06N 3/04 (2013.01)] | 20 Claims |

|
1. A method of quantizing a neural network, comprising:
loading a first bit-width neural network model of the neural network on a first processor of a mobile device implementing a first bit-width architecture;
performing cross-layer range equalization across at least a first layer and second adjacent layers in the first bit-width neural network model by:
scaling each output channel weight of the first layer of the first bit-width neural network model by a corresponding scaling factor in the first bit-width neural network model and generating a first layer of a first bit-width scaled neural network model; and
scaling each of the second adjacent layers' corresponding input channel weights by applying an inverse of the corresponding scaling factor to the input channel weights in the first bit-width neural network model and generating each of a plurality of second adjacent layers of the first bit-width scaled neural network model;
quantizing the output channel weights and corresponding input channel weights from the first bit-width scaled neural network model to a second bit-width architecture and generating a second bit-width neural network model;
loading the second bit-width neural network model on a second processor of the mobile device implementing the second bit-width architecture,
wherein:
the second bit-width architecture includes a smaller bit-width than the first bit-width architecture; and
executing the second bit-width neural network model on the second processor of the mobile device generates an inference result used to control a function of the mobile device associated with the second processor.
|