US 12,254,875 B2
	Multilingual re-scoring models for automatic speech recognition
Neeraj Gaur, Mountain View, CA (US); Tongzhou Chen, Mountain View, CA (US); Ehsan Variani, Mountain View, CA (US); Bhuvana Ramabhadran, Mt. Kisco, NY (US); Parisa Haghani, Mountain View, CA (US); and Pedro J. Moreno Mengibar, Jersey City, NJ (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Feb. 27, 2024, as Appl. No. 18/589,220.
Application 18/589,220 is a continuation of application No. 17/701,635, filed on Mar. 22, 2022, granted, now 12,080,283.
Claims priority of provisional application 63/166,916, filed on Mar. 26, 2021.
Prior Publication US 2024/0203409 A1, Jun. 20, 2024
Int. Cl. G10L 15/197 (2013.01); G10L 15/00 (2013.01); G10L 15/16 (2006.01); G10L 15/22 (2006.01)

CPC G10L 15/197 (2013.01) [G10L 15/005 (2013.01); G10L 15/16 (2013.01); G10L 15/22 (2013.01)]

20 Claims

1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising:

receiving transcribed audio training data comprising training audio data corresponding to an utterance paired with a ground-truth transcription of the utterance;

during a first pass, processing, using a speech recognition model, the training audio data to generate N candidate hypotheses for the utterance, each corresponding candidate hypothesis among the N candidate hypotheses having a respective first pass score;

during a second pass, for each corresponding candidate hypothesis of the N candidate hypotheses:

generating, using a neural network rescoring model, a respective second pass score based on the respective first pass score for the corresponding candidate hypothesis; and

applying a Softmax function to a respective negative edit-distance between the corresponding candidate hypothesis and the ground-truth transcription; and

optimizing model parameters of the neural network rescoring model based on the Softmax function applied to the respective negative edit-distance between the ground-truth transcription and each corresponding candidate hypothesis among the N candidate hypotheses.