CPC G10L 13/033 (2013.01) [G10L 13/047 (2013.01); G10L 13/10 (2013.01)] | 20 Claims |
1. A computer-implemented method, comprising:
receiving, from a first device, a first user input representing a natural language description of a desired synthetic voice;
processing, using a first encoder, the first user input to determine synthetic voice description embedding data representing the natural language description of the desired synthetic voice;
determining, using the synthetic voice description embedding data, first synthetic voice embedding data corresponding to a first proposed synthetic voice;
processing, using a decoder, the first synthetic voice embedding data to determine first synthetic voice characteristics data;
generating, using the first synthetic voice characteristics data and text data representing words, first output audio data representing first synthetic speech corresponding to the first proposed synthetic voice saying the words;
causing the first device to output the first output audio data;
receiving a second user input representing a user satisfaction corresponding to the first proposed synthetic voice;
based at least in part on the user satisfaction and the first synthetic voice embedding data, generating first data representing a first probability that second synthetic voice embedding data corresponding to a second proposed synthetic voice will result in higher user satisfaction than third synthetic voice embedding data corresponding to a third proposed synthetic voice;
based at least in part on the first data, selecting the second synthetic voice embedding data instead of the third synthetic voice embedding data;
processing, using the decoder, the second synthetic voice embedding data to determine second synthetic voice characteristics data;
generating, using the second synthetic voice characteristics data and the text data, second output audio data representing second synthetic speech corresponding to the second proposed synthetic voice saying the words; and
causing the first device to output the second output audio data.
|