US 12,260,851 B2
	Two-level text-to-speech systems using synthetic training data
Lev Finkelstein, Mountain View, CA (US); Chun-an Chan, Mountain View, CA (US); Byungha Chun, Tokyo (JP); Norman Casagrande, London (GB); Yu Zhang, Mountain View, CA (US); Robert Andrew James Clark, Hertfordshire (GB); and Vincent Wan, London (GB)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Jul. 14, 2021, as Appl. No. 17/305,809.
Prior Publication US 2023/0018384 A1, Jan. 19, 2023
Int. Cl. G10L 13/00 (2006.01); G10L 13/047 (2013.01); G10L 13/08 (2013.01)

CPC G10L 13/08 (2013.01) [G10L 13/047 (2013.01)]

27 Claims

1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising:

obtaining training data including a plurality of training audio signals and corresponding transcripts, each training audio signal corresponding to a reference utterance spoken by a target speaker in a first accent/dialect, each transcript comprising a textual representation of the corresponding reference utterance;

for each training audio signal of the training data:

generating, by a trained voice cloning system trained to generate synthesized speech that clones a voice of the target speaker in a second accent/dialect different than the first accent/dialect and configured to receive the training audio signal corresponding to the reference utterance spoken by the target speaker in the first accent/dialect as input, a training synthesized speech representation of the corresponding reference utterance spoken by the target speaker in the first accent/dialect, the training synthesized speech representation comprising an output audio waveform of synthesized speech that clones the voice of the target speaker in the second accent/dialect different than the first accent/dialect;

outputting, from the trained voice cloning system, the training synthesized speech representation comprising the output audio waveform of synthesized speech that clones the voice of the target speaker in the second accent/dialect different than the first accent/dialect;

obtaining a text-to-speech (TTS) system different than the trained voice cloning system, the TTS system not trained to generate synthesized speech that clones the voice of the target speaker in the second accent/dialect; and

training the TTS system to learn to generate synthesized speech that clones the voice of the target speaker in the second accent/dialect based on the corresponding transcript of the training audio signal and the training synthesized speech representation of the corresponding reference utterance output from the trained voice cloning system;

receiving an input 616text utterance to be synthesized into speech in the second accent/dialect;

obtaining conditioning inputs comprising a speaker embedding representing voice characteristics of the target speaker and an accent/dialect identifier identifying the second accent/dialect; and

generating, using the trained TTS system conditioned on the obtained conditioning inputs, by processing the input text utterance, an output audio waveform corresponding to a synthesized speech representation of the input text utterance that clones the voice of the target speaker in the second accent/dialect.