US 12,243,511 B1
Emphasizing portions of synthesized speech
Arnaud Vincent Pierre Yves Joly, Cambridge (GB); Marco Nicolis, London (GB); Elena Sergeevna Sokolova, London (GB); Jedrzej Sobanski, Gdansk (PL); Mateusz Aleksander Lajszczak, Cambridge (GB); Arent van Korlaar, London (GB); and Ruizhe Li, London (GB)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Mar. 31, 2022, as Appl. No. 17/709,788.
Int. Cl. G10L 13/10 (2013.01); G10L 13/033 (2013.01); G10L 13/04 (2013.01); G10L 13/06 (2013.01); G10L 15/26 (2006.01)
CPC G10L 13/10 (2013.01) [G10L 13/033 (2013.01); G10L 13/04 (2013.01); G10L 13/06 (2013.01); G10L 15/26 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method comprising:
receiving first input data representing natural language content for creation of synthesized speech;
receiving first speaker embedding data representing desired voice characteristics for the creation of synthesized speech;
performing grapheme-to-phoneme conversion using the first input data to determine first phoneme data representing phonemes of a first word of the first input data;
processing the first phoneme data and the first speaker embedding data using a first trained encoder to determine first phoneme embedding data representing the first phoneme data;
receiving a first indication that a first word in the first input data is to be emphasized;
determining first data by randomly or pseudorandomly sampling a first point from a first distribution of values in a latent space of a variational autoencoder;
processing the first indication, the first speaker embedding data, and the first data using a first trained decoder of the variational autoencoder to generate first acoustic embedding data corresponding to the first word;
combining the first phoneme embedding data and the first acoustic embedding data to determine modified phoneme embedding data; and
processing the modified phoneme embedding data using a second trained decoder to generate first audio data representing synthesized speech emphasizing the first word.