US 11,942,076 B2
Phoneme-based contextualization for cross-lingual speech recognition in end-to-end models
Ke Hu, Mountain View, CA (US); Golan Pundak, New York, NY (US); Rohit Prakash Prabhavalkar, Santa Clara, CA (US); Antoine Jean Bruguier, Milpitas, CA (US); and Tara N. Sainath, Jersey City, NJ (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Feb. 16, 2022, as Appl. No. 17/651,315.
Application 17/651,315 is a continuation of application No. 16/861,190, filed on Apr. 28, 2020, granted, now 11,270,687.
Claims priority of provisional application 62/842,571, filed on May 3, 2019.
Prior Publication US 2022/0172706 A1, Jun. 2, 2022
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 15/30 (2013.01); G10L 15/02 (2006.01); G10L 15/06 (2013.01); G10L 15/187 (2013.01); G10L 15/193 (2013.01); G10L 15/28 (2013.01); G10L 15/32 (2013.01); G10L 25/30 (2013.01)
CPC G10L 15/063 (2013.01) [G10L 15/02 (2013.01); G10L 15/187 (2013.01); G10L 15/193 (2013.01); G10L 15/285 (2013.01); G10L 15/32 (2013.01); G10L 25/30 (2013.01); G10L 2015/025 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method when received by data processing hardware causes the data processing hardware to perform comprising:
receiving audio data corresponding to an utterance, the utterance comprising at least one word in a first language and a particular word in a second language different than the first language;
receiving a biasing term list comprising the particular word in the second language;
processing, using a speech recognition model, the audio data to generate a plurality of speech recognition hypotheses for the utterance, each speech recognition hypothesis of the plurality of speech recognition hypotheses comprising a corresponding phoneme sequence in the first language and a corresponding speech recognition score;
rescoring, using one or more terms in the second language from the biasing term list, the corresponding speech recognition scores for the corresponding phoneme sequences in the first language generated by the speech recognition model; and
using the rescored speech recognition scores for the corresponding phoneme sequences in the first language, executing a decoding graph to output a transcription of the utterance by biasing the transcription to favor inclusion of the particular word in the second language from the biasing term list.