US 12,094,447 B2
Neural text-to-speech synthesis with multi-level text information
Huaiping Ming, Redmond, WA (US); and Lei He, Redmond, WA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Appl. No. 17/293,404
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
PCT Filed Dec. 13, 2018, PCT No. PCT/CN2018/120984
§ 371(c)(1), (2) Date May 12, 2021,
PCT Pub. No. WO2020/118643, PCT Pub. Date Jun. 18, 2020.
Prior Publication US 2022/0020355 A1, Jan. 20, 2022
Int. Cl. G10L 13/08 (2013.01); G06F 40/20 (2020.01); G06F 40/205 (2020.01); G06F 40/253 (2020.01); G06N 3/045 (2023.01); G06N 20/20 (2019.01); G10L 13/047 (2013.01); G10L 13/06 (2013.01); G10L 25/30 (2013.01)
CPC G10L 13/08 (2013.01) [G06F 40/20 (2020.01); G06F 40/205 (2020.01); G06F 40/253 (2020.01); G06N 3/045 (2023.01); G06N 20/20 (2019.01); G10L 13/047 (2013.01); G10L 13/06 (2013.01); G10L 25/30 (2013.01)] 11 Claims
OG exemplary drawing
 
1. A method for generating speech through neural text-to-speech (TTS) synthesis, comprising:
obtaining a text input having a sequence;
generating phoneme or character level text information based on the text input;
generating context-sensitive text information among words based on the text input, the context-sensitive text information having word level text information where generating the context-sensitive text information comprises:
identifying a word sequence from the text input;
up-sampling the word sequence to align with the text input;
generating a word embedding vector sequence of the word sequence and a phoneme vector sequence;
generating sentence level text information having a grammatical parsing information sequence, wherein generating the sentence level text information comprises:
performing grammatical parsing on the text input to obtain a grammatical structure of the text input; and
generating the grammatical parsing information sequence based on the grammatical structure of the text input by:
extracting grammatical parsing information of each word in the text input from the grammatical structure;
up-sampling the grammatical parsing information of each word to align with corresponding phonemes or characters in a phoneme or character sequence of the text input thereby generating a grammatical parsing information sequence; and
combining a phoneme vector sequence, the word embedding vector sequence and the grammatical parsing information sequence;
generating a text feature via a multi-input encoder coupled to receive the phoneme or character level text information and the word embedding vector sequence as inputs;
generating acoustic features from the text feature via a decoder; and
generating a speech waveform corresponding to the text input based at least on the text feature.