US 12,136,034 B2
	Dynamic gradient aggregation for training neural networks
Dimitrios B. Dimitriadis, Bellevue, WA (US); Kenichi Kumatani, Sammamish, WA (US); Robert Peter Gmyr, Bellevue, WA (US); Masaki Itagaki, Redmond, WA (US); Yashesh Gaur, Redmond, WA (US); Nanshan Zeng, Bellevue, WA (US); and Xuedong Huang, Bellevue, WA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Jul. 31, 2020, as Appl. No. 16/945,715.
Prior Publication US 2022/0036178 A1, Feb. 3, 2022
Int. Cl. G06N 3/08 (2023.01); G06N 3/04 (2023.01)

CPC G06N 3/08 (2013.01) [G06N 3/04 (2013.01)]

20 Claims

1. A system for training an automated speech recognition (ASR) neural network model, the system comprising:

at least one processor; and

at least one memory comprising computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the at least one processor to:

apply the ASR neural network model to each data set of a plurality of data sets representing speech data;

generate a plurality of gradients based on applying the ASR neural network model to each data set of the plurality of data sets, wherein an individual gradient of the plurality of gradients is generated based on an individual data set of the plurality of data sets;

determine, prior to updating the ASR neural network model, a plurality of gradient quality metrics for the plurality of gradients;

calculate, prior to updating the ASR neural network model, initial weight factors for the plurality of gradients based on the plurality of gradient quality metrics such that gradients generated from undistorted speech are assigned higher initial weight factors than gradients generated from distorted speech in advance of initial training processes associated with the plurality of data sets;

transform, prior to updating the ASR neural network model, the plurality of gradients into a plurality of weighted gradients based on the initial weight factors;

generate, prior to updating the ASR neural network model, a global gradient based on an aggregation of the plurality of weighted gradients; and

update the ASR neural network model based on the global gradient such that the resulting updated ASR neural network model is based on the initial training processes associated with the plurality of data sets, wherein the updated ASR neural network model, when applied to the individual data set, performs a task for ASR based on the individual data set and provides model output based on performing the task.