US 12,242,956 B2
	Systems and methods of cross layer rescaling for improved quantization performance
Markus Nagel, Amsterdam (NL); Marinus Willem van Baalen, Amsterdam (NL); and Tijmen Pieter Frederik Blankevoort, Amsterdam (NL)
Assigned to QUALCOMM Incorporated, San Diego, CA (US)
Filed by QUALCOMM Incorporated, San Diego, CA (US)
Filed on Mar. 23, 2020, as Appl. No. 16/826,524.
Claims priority of provisional application 62/822,254, filed on Mar. 22, 2019.
Prior Publication US 2020/0302299 A1, Sep. 24, 2020
Int. Cl. G06N 3/08 (2023.01); G06N 3/04 (2023.01)

CPC G06N 3/08 (2013.01) [G06N 3/04 (2013.01)]

20 Claims

1. A method of quantizing a neural network, comprising:

loading a first bit-width neural network model of the neural network on a first processor of a mobile device implementing a first bit-width architecture;

performing cross-layer range equalization across at least a first layer and second adjacent layers in the first bit-width neural network model by:

scaling each output channel weight of the first layer of the first bit-width neural network model by a corresponding scaling factor in the first bit-width neural network model and generating a first layer of a first bit-width scaled neural network model; and

scaling each of the second adjacent layers' corresponding input channel weights by applying an inverse of the corresponding scaling factor to the input channel weights in the first bit-width neural network model and generating each of a plurality of second adjacent layers of the first bit-width scaled neural network model;

quantizing the output channel weights and corresponding input channel weights from the first bit-width scaled neural network model to a second bit-width architecture and generating a second bit-width neural network model;

loading the second bit-width neural network model on a second processor of the mobile device implementing the second bit-width architecture,

wherein:

the second bit-width architecture includes a smaller bit-width than the first bit-width architecture; and

executing the second bit-width neural network model on the second processor of the mobile device generates an inference result used to control a function of the mobile device associated with the second processor.