US 12,087,273 B2
	Multilingual speech synthesis and cross-language voice cloning
Yu Zhang, Mountain View, CA (US); Ron J. Weiss, New York, NY (US); Byungha Chun, Tokyo (JP); Yonghui Wu, Fremont, CA (US); Zhifeng Chen, Sunnyvale, CA (US); Russell John Wyatt Skerry-Ryan, Mountain View, CA (US); Ye Jia, Mountain View, CA (US); Andrew M. Rosenberg, Brooklyn, NY (US); and Bhuvana Ramabhadran, Mt. Kisco, NY (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Jan. 30, 2023, as Appl. No. 18/161,217.
Application 18/161,217 is a continuation of application No. 16/855,042, filed on Apr. 22, 2020, granted, now 11,580,952.
Claims priority of provisional application 62/855,067, filed on May 31, 2019.
Prior Publication US 2023/0178068 A1, Jun. 8, 2023
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 21/00 (2013.01); G10L 13/00 (2006.01); G10L 13/047 (2013.01)

CPC G10L 13/047 (2013.01)

22 Claims

1. A computer-implemented method that when executed on data processing hardware causes the data processing hardware to perform operations comprising:

receiving an input text sequence in a first language;

obtaining a speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker; and

processing, using a multilingual text-to-speech (TTS) model configured to receive the speaker embedding and the input text sequence in the first language as input, the speaker embedding and the input text sequence in the first language to generate an output audio feature representation as output from the multilingual TTS model, the output audio feature representation representing synthesized speech that clones the voice of the target speaker in a second language different than the first language.