US 11,727,914 B2
Intent recognition and emotional text-to-speech learning
Pei Zhao, Redmond, WA (US); Kaisheng Yao, Remdond, WA (US); Max Leung, Redmond, WA (US); Bo Yan, Redmond, WA (US); Jian Luan, Redmond, WA (US); Yu Shi, Redmond, WA (US); Malone Ma, Redmond, WA (US); and Mei-Yuh Hwang, Redmond, WA (US)
Assigned to MICROSOFT TECHNOLOGY LICENSING, LLC, Redmond, WA (US)
Filed by MICROSOFT TECHNOLOGY LICENSING, LLC, Redmond, WA (US)
Filed on Dec. 24, 2021, as Appl. No. 17/561,895.
Application 17/561,895 is a continuation of application No. 16/309,399, granted, now 11,238,842, previously published as PCT/US2017/036241, filed on Jun. 7, 2017.
Claims priority of application No. 201610410602.3 (CN), filed on Jun. 13, 2016.
Prior Publication US 2022/0122580 A1, Apr. 21, 2022
Int. Cl. G10L 13/027 (2013.01); G06F 3/16 (2006.01); G10L 13/08 (2013.01); G10L 15/06 (2013.01); G10L 15/18 (2013.01); G10L 15/22 (2006.01); G10L 15/26 (2006.01); G10L 25/63 (2013.01)
CPC G10L 13/027 (2013.01) [G06F 3/167 (2013.01); G10L 13/08 (2013.01); G10L 15/063 (2013.01); G10L 15/1807 (2013.01); G10L 15/22 (2013.01); G10L 15/26 (2013.01); G10L 25/63 (2013.01); G10L 2015/223 (2013.01)] 16 Claims
OG exemplary drawing
 
1. A system for adapting an emotional text-to-speech model, the system comprising:
at least one processor; and
memory, operatively connected to the at least one processor and storing instructions that, when executed by the at least processor, cause the at least one processor to:
receive training examples comprising speech input;
receive labelling data comprising emotion information associated with the speech input;
extract audio signal vectors from the training examples;
adapt a voice font model based on the audio signal vectors and the labelling data to generate an emotion-adapted voice font model;
generate first prosody annotations from the speech input;
generate second prosody annotations from the labelling data;
determine differences between the first prosody annotations and the second prosody annotations;
generate a prosody model based on the determined differences between the first prosody annotations and the second prosody annotations;
generate a prosody-adjusted pronunciation sequence for text input using the prosody model; and
render the text input to synthesized speech using the emotion-adapted voice font model and the prosody-adjusted pronunciation sequence.