US 12,293,756 B2
Computing system for domain expressive text to speech
Arijit Mukherjee, Uttarpara (IN); Shubham Bansal, Yamunanagar (IN); Sandeepkumar Satpal, Hyderabad (IN); and Rupeshkumar Rasiklal Mehta, Hyderabad (IN)
Assigned to MICROSOFT TECHNOLOGY LICENSING, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Nov. 11, 2021, as Appl. No. 17/524,288.
Claims priority of provisional application 63/250,981, filed on Sep. 30, 2021.
Prior Publication US 2023/0099732 A1, Mar. 30, 2023
Int. Cl. G10L 13/02 (2013.01); G06N 3/088 (2023.01); G10L 13/00 (2006.01); G10L 13/10 (2013.01); G10L 25/30 (2013.01); G10L 25/63 (2013.01)
CPC G10L 13/02 (2013.01) [G06N 3/088 (2013.01); G10L 13/00 (2013.01); G10L 13/10 (2013.01); G10L 25/30 (2013.01); G10L 25/63 (2013.01); G10L 2013/105 (2013.01)] 18 Claims
OG exemplary drawing
 
1. A computing system, comprising:
a processor; and
memory storing instructions that, when executed by the processor, cause the processor to perform acts comprising:
obtaining computer-readable text comprising words;
providing the computer-readable text as input to an emotional classifier model that has been trained based upon a plurality of words having emotional labels assigned thereto, wherein the emotional labels identify respective emotions assigned to the plurality of words;
obtaining a first textual embedding of the computer-readable text as output of the emotional classifier model, wherein the first textual embedding represents semantics of the words;
generating a phoneme sequence based upon the words of the computer-readable text;
generating a phoneme encoding based upon the phoneme sequence;
providing the first textual embedding and the phoneme encoding as input to a text to speech (TTS) model, wherein the TTS model is trained using text-waveform pairs without emotional labels, wherein the TTS model generates a first output for the words that is indicative of an emotion and a first emotional intensity level be expressed when the words are audibly output;
receiving a value that is indicative of a second emotional intensity level;
replacing the first textual embedding with a second textual embedding based upon the value;
providing the second textual embedding as input to the TTS model in place of the first textual embedding, wherein the TTS model generates a second output for the words that is indicative of a second emotional intensity level of the words when the words are audibly output; and
causing speech that includes the words to be played over a speaker based upon the second output of the TTS model, wherein the speech expresses the emotion and the second emotional intensity level.