US 12,087,270 B1
User-customized synthetic voice
Sebastian Dariusz Cygert, Gdansk (PL); Daniel Korzekwa, Gdansk (PL); Kamil Pokora, Gdansk (PL); Piotr Tadeusz Bilinski, Warsaw (PL); Kayoko Yanagisawa, Cambridge (GB); Abdelhamid Ezzerg, Cambridge (GB); Thomas Edward Merritt, Downham Market (GB); Raghu Ram Sreepada Srinivas, Snohomish, WA (US); and Nikhil Sharma, Kirkland, WA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Sep. 29, 2022, as Appl. No. 17/955,961.
Int. Cl. G10L 15/16 (2006.01); G10L 13/033 (2013.01); G10L 13/047 (2013.01); G10L 13/10 (2013.01); G10L 15/06 (2013.01); G10L 25/30 (2013.01)
CPC G10L 13/033 (2013.01) [G10L 13/047 (2013.01); G10L 13/10 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method, comprising:
receiving, from a first device, a first user input representing a natural language description of a desired synthetic voice;
processing, using a first encoder, the first user input to determine synthetic voice description embedding data representing the natural language description of the desired synthetic voice;
determining, using the synthetic voice description embedding data, first synthetic voice embedding data corresponding to a first proposed synthetic voice;
processing, using a decoder, the first synthetic voice embedding data to determine first synthetic voice characteristics data;
generating, using the first synthetic voice characteristics data and text data representing words, first output audio data representing first synthetic speech corresponding to the first proposed synthetic voice saying the words;
causing the first device to output the first output audio data;
receiving a second user input representing a user satisfaction corresponding to the first proposed synthetic voice;
based at least in part on the user satisfaction and the first synthetic voice embedding data, generating first data representing a first probability that second synthetic voice embedding data corresponding to a second proposed synthetic voice will result in higher user satisfaction than third synthetic voice embedding data corresponding to a third proposed synthetic voice;
based at least in part on the first data, selecting the second synthetic voice embedding data instead of the third synthetic voice embedding data;
processing, using the decoder, the second synthetic voice embedding data to determine second synthetic voice characteristics data;
generating, using the second synthetic voice characteristics data and the text data, second output audio data representing second synthetic speech corresponding to the second proposed synthetic voice saying the words; and
causing the first device to output the second output audio data.