US 11,990,117 B2
	Using speech recognition to improve cross-language speech synthesis
Zhehuai Chen, Jersey City, NJ (US); Bhuvana Ramabhadran, Mt. Kisco, NY (US); Andrew Rosenberg, Brooklyn, NY (US); Yu Zhang, Mountain View, CA (US); and Pedro J. Moreno Mengibar, Jersey City, NJ (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Oct. 20, 2021, as Appl. No. 17/451,613.
Claims priority of provisional application 63/094,786, filed on Oct. 21, 2020.
Prior Publication US 2022/0122581 A1, Apr. 21, 2022
Int. Cl. G10L 13/047 (2013.01); G10L 13/08 (2013.01); G10L 13/10 (2013.01)

CPC G10L 13/047 (2013.01) [G10L 13/086 (2013.01); G10L 13/10 (2013.01)]

20 Claims

1. A computer-implemented method executed on data processing hardware that causes the data processing hardware to perform operations, the operations comprising:

obtaining a multilingual text-to-speech (TTS) model comprising:

an encoder portion that shares language embeddings across the first and second languages; and

a decoder portion that shares the language embeddings across the first and second languages and shares speaker embeddings for both native speakers of the first language and native speakers of the second language, wherein a number of speaker embeddings for the native speakers of the first language is less than a number of speaker embeddings for the native speakers of the second language;

generating, using the multilingual TTS model, a native synthesized speech representation for an input text sequence in a first language, the native synthesized speech representation conditioned on speaker characteristics of a native speaker of the first language;

generating, using the multilingual TTS model, a cross-lingual synthesized speech representation for the input text sequence in the first language, the cross-lingual synthesized speech representation conditioned on speaker characteristics of a native speaker of a different second language;

generating, using a speech recognition model, a first speech recognition result for the native synthesized speech representation, the speech recognition model comprising a neural network trained to generate transcriptions of audio data;

generating, using the speech recognition model, a second speech recognition result for the cross-lingual synthesized speech representation;

determining a consistent loss term based on the first speech recognition result and the second speech recognition result; and

updating, using machine learning, parameters of the speech recognition model based on the consistent loss term.