US 12,482,453 B2
	Training for long-form speech recognition
Zhiyun Lu, Brooklyn, NY (US); Thibault Doutre, Mountain View, CA (US); Yanwei Pan, Mountain View, CA (US); Liangliang Cao, Mountain View, CA (US); Rohit Prabhavalkar, Mountain View, CA (US); Trevor Strohman, Mountain View, CA (US); and Chao Zhang, Mountain View, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Sep. 27, 2022, as Appl. No. 17/935,924.
Claims priority of provisional application 63/262,137, filed on Oct. 5, 2021.
Prior Publication US 2023/0103382 A1, Apr. 6, 2023
Int. Cl. G10L 15/06 (2013.01); G10L 15/04 (2013.01); G10L 15/16 (2006.01); G10L 15/197 (2013.01); G10L 15/22 (2006.01); G10L 15/07 (2013.01); G10L 15/18 (2013.01)

CPC G10L 15/063 (2013.01) [G10L 15/04 (2013.01); G10L 15/06 (2013.01); G10L 15/16 (2013.01); G10L 15/197 (2013.01); G10L 15/22 (2013.01); G10L 2015/0636 (2013.01); G10L 15/075 (2013.01); G10L 15/1815 (2013.01)]

14 Claims

1. A computer-implemented method for training a speech recognition model to recognize long-form speech, the computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising:

obtaining a set of training samples, each training sample in the set of training samples comprising a corresponding sequence of speech segments corresponding to a training utterance and a corresponding sequence of ground-truth transcriptions for the sequence of speech segments, wherein each ground-truth transcription in the corresponding sequence of ground-truth transcriptions comprises a start time and an end time of a corresponding speech segment; and

for each training sample in the set of training samples:

processing, using the speech recognition model, the corresponding sequence of speech segments to obtain an N-best list of speech recognition hypotheses for the training utterance;

for each speech recognition hypothesis in the N-best list of speech recognition hypotheses, identifying a respective number of word errors relative to the corresponding sequence of ground-truth transcriptions;

computing an average of the respective number of word errors identified for each speech recognition hypothesis in the N-best list of speech recognition hypotheses; and

training the speech recognition model to minimize word error rate based on the computed average of the respective number of word errors in the N-best list of speech recognition hypotheses by, for each speech recognition hypothesis in the N-best list of speech recognition hypotheses, one of:

increasing a probability of the speech recognition hypothesis when the respective number of word errors is less than the computed average; or

decreasing the probability of the speech recognition hypothesis when the respective number of word errors is greater than the computed average.