CPC G10L 13/02 (2013.01) [G06N 3/088 (2013.01); G10L 13/00 (2013.01); G10L 13/10 (2013.01); G10L 25/30 (2013.01); G10L 25/63 (2013.01); G10L 2013/105 (2013.01)] | 18 Claims |
1. A computing system, comprising:
a processor; and
memory storing instructions that, when executed by the processor, cause the processor to perform acts comprising:
obtaining computer-readable text comprising words;
providing the computer-readable text as input to an emotional classifier model that has been trained based upon a plurality of words having emotional labels assigned thereto, wherein the emotional labels identify respective emotions assigned to the plurality of words;
obtaining a first textual embedding of the computer-readable text as output of the emotional classifier model, wherein the first textual embedding represents semantics of the words;
generating a phoneme sequence based upon the words of the computer-readable text;
generating a phoneme encoding based upon the phoneme sequence;
providing the first textual embedding and the phoneme encoding as input to a text to speech (TTS) model, wherein the TTS model is trained using text-waveform pairs without emotional labels, wherein the TTS model generates a first output for the words that is indicative of an emotion and a first emotional intensity level be expressed when the words are audibly output;
receiving a value that is indicative of a second emotional intensity level;
replacing the first textual embedding with a second textual embedding based upon the value;
providing the second textual embedding as input to the TTS model in place of the first textual embedding, wherein the TTS model generates a second output for the words that is indicative of a second emotional intensity level of the words when the words are audibly output; and
causing speech that includes the words to be played over a speaker based upon the second output of the TTS model, wherein the speech expresses the emotion and the second emotional intensity level.
|