US 11,681,905 B2
	Hardware-assisted gradient optimization using streamed gradients
Jinwen Xi, Sunnyvale, CA (US); Bharadwaj Pudipeddi, San Jose, CA (US); and Marc Tremblay, Bellevue, WA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Mar. 23, 2020, as Appl. No. 16/827,367.
Prior Publication US 2021/0295141 A1, Sep. 23, 2021
Int. Cl. G06N 3/06 (2006.01); G06N 3/063 (2023.01); G06N 20/00 (2019.01); G06N 3/084 (2023.01); G06N 5/046 (2023.01); G11C 11/34 (2006.01)

CPC G06N 3/063 (2013.01) [G06N 3/084 (2013.01); G06N 5/046 (2013.01); G06N 20/00 (2019.01); G11C 11/34 (2013.01)]

20 Claims

1. A method in a system comprising a memory configured to store weights associated with a neural network model comprising L layers, wherein L is an integer greater than one, a gradient optimizer; and a plurality of workers, wherein each of the plurality of workers is configured to perform a forward pass and a backward pass on any one of the L layers associated with the neural network model, the method comprising:

during a single burst cycle, receiving gradients from each of the plurality of workers into a predetermined number of gradient buffers;

during the single burst cycle, providing the received gradients to a reduction block to generate reduced gradients;

during the single burst cycle, providing the reduced gradients to a gradient optimizer data path associated with the gradient optimizer;

during the single burst cycle moving weights from at least one buffer, coupled to the memory, to the gradient optimizer;

during the single burst cycle writing back new weights, calculated by the gradient optimizer, to the memory;

during the single burst cycle transmitting the new weights, from the gradient optimizer, to each of the plurality of workers, wherein during the single burst cycle the gradient optimizer operates on a gradient burst having a fixed number of gradients; and

during each of successive burst cycles, performing operations such that there is at least a partial overlap among operations related to: (1) receiving gradients into the predetermined number of gradient buffers, (2) providing received gradients to the reduction block, (3) providing reduced gradients to the gradient optimizer data path, and (4) writing back new weights to the memory.