US 12,067,969 B2
Variational embedding capacity in expressive end-to-end speech synthesis
Eric Dean Battenberg, Sunnyvale, CA (US); Daisy Stanton, Mountain View, CA (US); Russell John Wyatt Skerry-Ryan, Mountain View, CA (US); Soroosh Mariooryad, Redwood City, CA (US); David Teh-Hwa Kao, San Francisco, CA (US); Thomas Edward Bagby, San Francisco, CA (US); and Sean Matthew Shannon, Mountain View, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Apr. 18, 2023, as Appl. No. 18/302,764.
Application 18/302,764 is a continuation of application No. 17/643,455, filed on Dec. 9, 2021, granted, now 11,646,010.
Application 17/643,455 is a continuation of application No. 16/879,714, filed on May 20, 2020, granted, now 11,222,621, issued on Jan. 11, 2022.
Claims priority of provisional application 62/851,879, filed on May 23, 2019.
Prior Publication US 2023/0260504 A1, Aug. 17, 2023
Int. Cl. G10L 13/00 (2006.01); G10L 13/047 (2013.01); G10L 13/10 (2013.01)
CPC G10L 13/047 (2013.01) [G10L 13/10 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method executed on data processing hardware that causes the data processing hardware to perform operations comprising:
obtaining a variational embedding output from a reference encoder;
generating, using a text-to-speech (TTS) model comprising a waveform synthesizer, synthesized speech based on the variational embedding, target text, and a target speaker;
pairing the variational embedding with the target text and the target speaker; and
decomposing the variational embedding into a first hierarchical fraction and a second hierarchical fraction by:
computing the first hierarchical fraction decomposed from the variational embedding paired with the target text and target speaker; and
sampling, using the first hierarchical fraction, the second hierarchical fraction associated with the variational embedding.