CPC G10L 13/047 (2013.01) [G10L 13/10 (2013.01)] | 20 Claims |
1. A computer-implemented method executed on data processing hardware that causes the data processing hardware to perform operations comprising:
obtaining a variational embedding output from a reference encoder;
generating, using a text-to-speech (TTS) model comprising a waveform synthesizer, synthesized speech based on the variational embedding, target text, and a target speaker;
pairing the variational embedding with the target text and the target speaker; and
decomposing the variational embedding into a first hierarchical fraction and a second hierarchical fraction by:
computing the first hierarchical fraction decomposed from the variational embedding paired with the target text and target speaker; and
sampling, using the first hierarchical fraction, the second hierarchical fraction associated with the variational embedding.
|