US 11,922,932 B2
	Minimum word error rate training for attention-based sequence-to-sequence models
Rohit Prakash Prabhavalkar, Palo Alto, CA (US); Tara N. Sainath, Jersey City, NJ (US); Yonghui Wu, Fremont, CA (US); Patrick An Phu Nguyen, Mountain View, CA (US); Zhifeng Chen, Sunnyvale, CA (US); Chung-Cheng Chiu, Sunnyvale, CA (US); and Anjuli Patricia Kannan, Berkeley, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Mar. 31, 2023, as Appl. No. 18/194,586.
Application 18/194,586 is a continuation of application No. 17/443,557, filed on Jul. 27, 2021, granted, now 11,646,019.
Application 17/443,557 is a continuation of application No. 16/529,252, filed on Aug. 1, 2019, granted, now 11,107,463, issued on Aug. 31, 2021.
Claims priority of provisional application 62/713,332, filed on Aug. 1, 2018.
Prior Publication US 2023/0237995 A1, Jul. 27, 2023
Int. Cl. G10L 15/197 (2013.01); G10L 15/02 (2006.01); G10L 15/06 (2013.01); G10L 15/16 (2006.01); G10L 15/22 (2006.01)

CPC G10L 15/197 (2013.01) [G10L 15/02 (2013.01); G10L 15/063 (2013.01); G10L 15/16 (2013.01); G10L 15/22 (2013.01); G10L 2015/025 (2013.01)]

20 Claims

1. A computer-implemented method executed on data processing hardware that causes the data processing hardware to perform operations comprising:

receiving a sequence of feature vectors indicative of acoustic characteristics of a training utterance;

receiving a ground-truth label sequence corresponding to the training utterance; and

training a speech recognition model to minimize word error rate by performing operations comprising:

processing, using the speech recognition model, the sequence of feature vectors to obtain a set of speech recognition hypothesis samples for the training utterance in a beam search;

for each speech recognition hypothesis sample in the set of speech recognition hypothesis samples, identifying a respective number of word errors relative to the ground-truth label sequence corresponding to the training utterance; and

approximating a loss function based on a combination of the respective numbers of word errors identified for each speech recognition hypothesis sample in the set of speech recognition hypothesis samples.