US 12,443,848 B2
	Neural network activation compression with narrow block floating-point
Daniel Lo, Bothell, WA (US); Amar Phanishayee, Seattle, WA (US); Eric S. Chung, Woodinville, WA (US); Yiren Zhao, Cambridge (GB); and Ritchie Zhao, Ithaca, NY (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Dec. 31, 2018, as Appl. No. 16/237,197.
Prior Publication US 2020/0210838 A1, Jul. 2, 2020
Int. Cl. G06N 3/084 (2023.01); G06F 7/499 (2006.01); G06F 9/30 (2018.01); G06F 9/50 (2006.01); G06N 5/046 (2023.01); G06N 20/00 (2019.01)

CPC G06N 3/084 (2013.01) [G06F 7/49915 (2013.01); G06F 9/30025 (2013.01); G06F 9/5027 (2013.01); G06N 5/046 (2013.01); G06N 20/00 (2019.01)]

20 Claims

1. A computing system comprising:

one or more hardware processors;

at least one memory coupled to the one or more hardware processors; and

one or more computer-readable storage media storing computer-executable instructions, or hardware comprising logic implementing the computer-executable instructions, that, when executed by the computing system, cause the computing system to perform operations comprising:

for each of multiple layers of a neural network comprising a plurality of layers:

performing forward propagation for a given layer of the multiple layers of the neural network using a set of input data to produce activation values in a first quantized block floating-point format for the given layer, the first quantized block floating-point format having a first numerical precision;

converting at least one of the produced activation values in the first quantized block floating-point format for the given layer to a second, different, quantized block floating-point format to produce compressed activation values for the given layer, the second quantized block floating-point format having a second numerical precision less than the first numerical precision;

storing the compressed activation values to provide stored compressed activation values for the given layer; and

for layers of the multiple layers that are not a final layer of the neural network, forward propagating the activation values in the first quantized block-floating point format for the given layer to a next layer of the multiple layers;

calculating a measure of loss for an initial set of input data provided to the neural network based on results of forward propagation through the multiple layers using the activation values in the first quantized block floating-point format provided by an output layer of the neural network, wherein the output layer receives a set of input values of a prior layer of the multiple layers in the first quantized block floating-point format;

retrieving the stored compressed activation values for the multiple layers;

decompressing the stored compressed activation values for given layers of the multiple layers from the second quantized block floating-point format to the first block floating point format to provide decompressed activation values for the given layers of the multiple layers; and

performing backpropagation for the multiple layers of the neural network using the measure of loss and respective decompressed activation values for respective given layers of the multiple layers.