US 11,996,088 B2
Setting latency constraints for acoustic models
Andrew W. Senior, New York, NY (US); Hasim Sak, Santa Clara, CA (US); and Kanury Kanishka Rao, Santa Clara, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Jul. 1, 2020, as Appl. No. 16/918,669.
Application 16/918,669 is a continuation of application No. 14/879,225, filed on Oct. 9, 2015, granted, now 10,733,979.
Prior Publication US 2020/0335093 A1, Oct. 22, 2020
Int. Cl. G10L 15/16 (2006.01); G06N 3/044 (2023.01); G06N 3/045 (2023.01)
CPC G10L 15/16 (2013.01) [G06N 3/044 (2023.01); G06N 3/045 (2023.01)] 20 Claims
OG exemplary drawing
 
1. A method of training a neural network, the method comprising:
receiving, at data processing hardware, training data comprising training audio data for an utterance and a sequence of phone labels identifying respective phones that occur in the utterance;
generating, by the data processing hardware, using an acoustic model, a reference alignment indicating ground truth audio frames that the phone labels in the sequence of phone labels occur in a sequence of audio frames representing the training audio data;
for each phone label, setting, by the data processing hardware, a respective constrained range of audio frames in the sequence of audio frames that ends a predetermined number of audio frames after a respective ground truth ending audio frame that the respective phone label last occurs in the reference alignment;
providing, by the data processing hardware, the training audio data to the neural network executing on the data processing hardware, the neural network configured to:
receive, as input, each audio frame in the sequence of audio frames representing the training audio data; and
generate, as output, a sequence of output labels indicating the occurrence of each respective phone;
determining, by the data processing hardware, for each particular output label of the sequence of output labels, a corresponding delay between output of the particular output label and a corresponding respective constrained range of audio frames; and
updating, by the data processing hardware, using the sequence of output labels generated and the corresponding delays, parameters of the neural network by applying a penalty when a corresponding delay satisfies a threshold constraint.