| CPC G10L 13/10 (2013.01) [G10L 13/033 (2013.01); G10L 13/04 (2013.01); G10L 13/06 (2013.01); G10L 15/26 (2013.01)] | 20 Claims |

|
1. A computer-implemented method comprising:
receiving first input data representing natural language content for creation of synthesized speech;
receiving first speaker embedding data representing desired voice characteristics for the creation of synthesized speech;
performing grapheme-to-phoneme conversion using the first input data to determine first phoneme data representing phonemes of a first word of the first input data;
processing the first phoneme data and the first speaker embedding data using a first trained encoder to determine first phoneme embedding data representing the first phoneme data;
receiving a first indication that a first word in the first input data is to be emphasized;
determining first data by randomly or pseudorandomly sampling a first point from a first distribution of values in a latent space of a variational autoencoder;
processing the first indication, the first speaker embedding data, and the first data using a first trained decoder of the variational autoencoder to generate first acoustic embedding data corresponding to the first word;
combining the first phoneme embedding data and the first acoustic embedding data to determine modified phoneme embedding data; and
processing the modified phoneme embedding data using a second trained decoder to generate first audio data representing synthesized speech emphasizing the first word.
|