CPC G06N 3/08 (2013.01) [G06F 7/483 (2013.01); G06N 3/04 (2013.01)] | 20 Claims |
1. A method of training a neural network that has a batch normalized first neural network layer, the method comprising:
during the training, maintaining (i) moving averages of batch normalization statistics for the batch normalized first neural network layer and (ii) floating point weights for the batch normalized first neural network layer;
receiving a first batch of training data;
processing the first batch of training data through one or more layers of the neural network to generate a batch of layer inputs to the batch normalized first neural network layer;
determining batch normalization statistics for the first batch of training data using the floating point weights, comprising:
performing operations of the batch normalized first neural network layer on the batch of layer inputs using the floating point weights to generate a batch of initial layer outputs; and
computing batch normalization statistics based on the batch of initial layer outputs;
determining a correction factor from the batch normalization statistics for the first batch of training data and the moving averages of the batch normalization statistics;
generating batch normalized weights from the floating point weights for the batch normalized first neural network layer using the correction factor, comprising applying the correction factor to the floating point weights of the batch normalized first neural network layer;
quantizing the batch normalized weights to generate quantized batch normalized weights;
after quantizing the batch normalized weights, performing operations of the batch normalized first neural network layer on the batch of layer inputs using the quantized batch normalized weights to generate a batch of layer outputs, comprising:
applying the quantized batch normalized weights to each of the batch of layer inputs to generate a respective initial output for each layer input in the batch of layer inputs;
determining whether the training of the neural network has reached a first threshold;
in response to determining that the training of the neural network has reached the first threshold, switching to using a long-term moving average of the batch normalization statistics by refraining from dividing each respective initial output by the correction factor to introduce a dependence on the batch normalization statistics;
determining, using the batch of layer outputs, a gradient of an objective function with respect to the quantized batch normalized weights for the first neural network layer; and
updating the floating point weights of the first neural network layer using the gradient with respect to the quantized batch normalized weights.
|