| CPC G10L 13/02 (2013.01) [G06N 3/088 (2013.01); G10L 13/00 (2013.01); G10L 13/10 (2013.01); G10L 25/30 (2013.01); G10L 25/63 (2013.01); G10L 2013/105 (2013.01)] | 18 Claims | 

| 
               1. A computing system, comprising: 
            a processor; and 
                memory storing instructions that, when executed by the processor, cause the processor to perform acts comprising: 
              obtaining computer-readable text comprising words; 
                  providing the computer-readable text as input to an emotional classifier model that has been trained based upon a plurality of words having emotional labels assigned thereto, wherein the emotional labels identify respective emotions assigned to the plurality of words; 
                  obtaining a first textual embedding of the computer-readable text as output of the emotional classifier model, wherein the first textual embedding represents semantics of the words; 
                  generating a phoneme sequence based upon the words of the computer-readable text; 
                  generating a phoneme encoding based upon the phoneme sequence; 
                  providing the first textual embedding and the phoneme encoding as input to a text to speech (TTS) model, wherein the TTS model is trained using text-waveform pairs without emotional labels, wherein the TTS model generates a first output for the words that is indicative of an emotion and a first emotional intensity level be expressed when the words are audibly output; 
                  receiving a value that is indicative of a second emotional intensity level; 
                  replacing the first textual embedding with a second textual embedding based upon the value; 
                  providing the second textual embedding as input to the TTS model in place of the first textual embedding, wherein the TTS model generates a second output for the words that is indicative of a second emotional intensity level of the words when the words are audibly output; and 
                  causing speech that includes the words to be played over a speaker based upon the second output of the TTS model, wherein the speech expresses the emotion and the second emotional intensity level. 
                 |