US 11,837,216 B2
Speech recognition using unspoken text and speech synthesis
Zhehuai Chen, Jersey City, NJ (US); Andrew M. Rosenberg, Brooklyn, NY (US); Bhuvana Ramabhadran, Mt. Kisco, NY (US); and Pedro J. Moreno Mengibar, Jersey City, NJ (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Feb. 14, 2023, as Appl. No. 18/168,969.
Application 18/168,969 is a continuation of application No. 17/454,536, filed on Nov. 11, 2021, granted, now 11,605,368.
Application 17/454,536 is a continuation of application No. 16/869,552, filed on May 7, 2020, granted, now 11,222,620, issued on Jan. 11, 2022.
Prior Publication US 2023/0197057 A1, Jun. 22, 2023
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 13/00 (2006.01); G10L 13/08 (2013.01); G10L 15/06 (2013.01)
CPC G10L 13/00 (2013.01) [G10L 13/08 (2013.01); G10L 15/063 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method executed on data processing hardware that causes the data processing hardware to perform operations comprising:
obtaining a spoken training utterance comprising a corresponding transcription paired with a corresponding non-synthetic speech representation of the spoken training utterance;
obtaining an embedding representing speaker characteristics of a speaker that spoke the spoken training utterance;
conditioning the corresponding transcription of the spoken training utterance on the embedding representing the speaker characteristics of the speaker that spoke the spoken training utterance;
generating, as output from a text-to-speech (TTS) model configured to receive the corresponding transcription of the spoken training utterance as input, a synthetic speech representation of the spoken training utterance conditioned on the embedding; and
training a speech recognition model on the non-synthetic speech representation of the spoken training utterance and the synthetic speech representation generated as output from the TTS model.