US 11,900,915 B2
	Multi-dialect and multilingual speech recognition
Zhifeng Chen, Sunnyvale, CA (US); Bo Li, Fremont, CA (US); Eugene Weinstein, New York, NY (US); Yonghui Wu, Fremont, CA (US); Pedro J. Moreno Mengibar, Jersey City, NJ (US); Ron J. Weiss, New York, NY (US); Khe Chai Sim, Cupertino, CA (US); Tara N. Sainath, Jersey City, NJ (US); and Patrick An Phu Nguyen, Palo Alto, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Jan. 10, 2022, as Appl. No. 17/572,238.
Application 17/572,238 is a continuation of application No. 16/684,483, filed on Nov. 14, 2019, granted, now 11,238,845.
Claims priority of provisional application 62/770,534, filed on Nov. 21, 2018.
Prior Publication US 2022/0130374 A1, Apr. 28, 2022
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 15/00 (2013.01); G10L 15/16 (2006.01); G10L 15/07 (2013.01); G10L 15/06 (2013.01)

CPC G10L 15/005 (2013.01) [G10L 15/07 (2013.01); G10L 15/16 (2013.01); G10L 2015/0631 (2013.01)]

20 Claims

1. A computer-implemented method of performing speech recognition, the method when executed on data processing hardware causes the data processing hardware to perform operations comprising:

receiving audio data indicating audio characteristics of an utterance;

providing, as input to an automatic speech recognition model, speech features determined based on the audio data, wherein the speech recognition model has been trained, using cluster adaptive training:

to recognize linguistic units for each of multiple different languages or dialects, with each of the multiple different languages or dialects corresponding to a separate cluster;

to receive, as input, different identifiers that specify the different clusters corresponding to the respective languages or dialects; and

to compute a weighted sum of the means of the different clusters, wherein the means of the different clusters are weighted based on the different identifiers;

based on the speech features provided as input to the speech recognition model, generating, as output from the speech recognition model at each of a plurality of time steps, an output vector at the corresponding time step indicating a probability distribution over a predetermined set of linguistic units for each of the multiple different languages or dialects the speech recognition model has been trained to recognize; and

providing, as an output of the automated speech recognition model, a transcription of the utterance generated based on the output vectors generated as output from the speech recognition model at each of the plurality of time steps.