US 12,254,865 B2
Multi-dialect and multilingual speech recognition
Zhifeng Chen, Sunnyvale, CA (US); Bo Li, Santa Clara, CA (US); Eugene Weinstein, New York, NY (US); Yonghui Wu, Fremont, CA (US); Pedro J. Moreno Mengibar, Jersey City, NJ (US); Ron J. Weiss, New York, NY (US); Khe Chai Sim, Dublin, CA (US); Tara N. Sainath, Jersey City, NJ (US); and Patrick An Phu Nguyen, Palo Alto, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Jan. 20, 2024, as Appl. No. 18/418,246.
Application 18/418,246 is a continuation of application No. 17/572,238, filed on Jan. 10, 2022, granted, now 11,900,915.
Application 17/572,238 is a continuation of application No. 16/684,483, filed on Nov. 14, 2019, granted, now 11,238,845, issued on Feb. 1, 2022.
Claims priority of provisional application 62/770,534, filed on Nov. 21, 2018.
Prior Publication US 2024/0161732 A1, May 16, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 15/00 (2013.01); G10L 15/06 (2013.01); G10L 15/07 (2013.01); G10L 15/16 (2006.01)
CPC G10L 15/005 (2013.01) [G10L 15/07 (2013.01); G10L 15/16 (2013.01); G10L 2015/0631 (2013.01)] 18 Claims
OG exemplary drawing
 
1. A computer-implemented method of jointly performing speech recognition and language prediction using a sequence-to-sequence speech recognition model, the method when executed on data processing hardware causes the data processing hardware to perform operations comprising:
receiving audio data characterizing a spoken utterance;
processing, using the sequence-to-sequence speech recognition model, the audio data to generate, at each of a plurality of time steps:
a probability distribution over a predetermined set of linguistic units; and
a predicted language of the spoken utterance among multiple different languages the speech recognition model has been trained to recognize; and
providing, as an output from the sequence-to-sequence speech recognition model, a transcription of the utterance based on the probability distribution over the predetermined set of linguistic units and the predicted language generated at each of the plurality of time steps,
wherein the speech recognition model is trained using multi-task learning using:
a first objective function corresponding to grapheme prediction; and
a second objective function corresponding to a language or dialect classification cost, the first objective function and second objective function being weighted such that the speech recognition model is trained to learn hidden representations that are effective for both language and dialect classification and grapheme prediction.