US 12,080,283 B2
Multilingual re-scoring models for automatic speech recognition
Neeraj Gaur, Mountain View, CA (US); Tongzhou Chen, Mountain View, CA (US); Ehsan Variani, Mountain View, CA (US); Bhuvana Ramabhadran, Mt. Kisco, NY (US); Parisa Haghani, Mountain View, CA (US); and Pedro J. Moreno Mengibar, Jersey City, NJ (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Mar. 22, 2022, as Appl. No. 17/701,635.
Claims priority of provisional application 63/166,916, filed on Mar. 26, 2021.
Prior Publication US 2022/0310081 A1, Sep. 29, 2022
Int. Cl. G10L 15/197 (2013.01); G10L 15/00 (2013.01); G10L 15/16 (2006.01); G10L 15/22 (2006.01)
CPC G10L 15/197 (2013.01) [G10L 15/005 (2013.01); G10L 15/16 (2013.01); G10L 15/22 (2013.01)] 14 Claims
OG exemplary drawing
 
1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising:
receiving a sequence of acoustic frames extracted from audio data corresponding to an utterance;
receiving a language identifier indicating a language of the utterance;
during a first pass, processing, using a multilingual speech recognition model, the sequence of acoustic frames to generate N candidate hypotheses for the utterance;
during a second pass, for each candidate hypothesis of the N candidate hypotheses:
generating, using a neural oracle search (NOS) model, a respective un-normalized likelihood score based on the sequence of acoustic frames and the corresponding candidate hypothesis;
selecting a language-specific external language model from among a plurality of language-specific external language models each trained on a different respective language;
generating, using the language-specific external language model, a respective external language model score;
generating a standalone score that models prior statistics of the corresponding candidate hypothesis generated during the first pass; and
generating a respective overall score for the candidate hypothesis based on the un-normalized likelihood score, the external language model score, and the standalone score; and
selecting the candidate hypothesis having the highest respective overall score from among the N candidate hypotheses as a final transcription of the utterance.