| CPC G10L 15/063 (2013.01) [G10L 15/04 (2013.01); G10L 15/06 (2013.01); G10L 15/16 (2013.01); G10L 15/197 (2013.01); G10L 15/22 (2013.01); G10L 2015/0636 (2013.01); G10L 15/075 (2013.01); G10L 15/1815 (2013.01)] | 14 Claims |

|
1. A computer-implemented method for training a speech recognition model to recognize long-form speech, the computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising:
obtaining a set of training samples, each training sample in the set of training samples comprising a corresponding sequence of speech segments corresponding to a training utterance and a corresponding sequence of ground-truth transcriptions for the sequence of speech segments, wherein each ground-truth transcription in the corresponding sequence of ground-truth transcriptions comprises a start time and an end time of a corresponding speech segment; and
for each training sample in the set of training samples:
processing, using the speech recognition model, the corresponding sequence of speech segments to obtain an N-best list of speech recognition hypotheses for the training utterance;
for each speech recognition hypothesis in the N-best list of speech recognition hypotheses, identifying a respective number of word errors relative to the corresponding sequence of ground-truth transcriptions;
computing an average of the respective number of word errors identified for each speech recognition hypothesis in the N-best list of speech recognition hypotheses; and
training the speech recognition model to minimize word error rate based on the computed average of the respective number of word errors in the N-best list of speech recognition hypotheses by, for each speech recognition hypothesis in the N-best list of speech recognition hypotheses, one of:
increasing a probability of the speech recognition hypothesis when the respective number of word errors is less than the computed average; or
decreasing the probability of the speech recognition hypothesis when the respective number of word errors is greater than the computed average.
|