CPC G10L 13/08 (2013.01) [G06F 40/20 (2020.01); G06F 40/205 (2020.01); G06F 40/253 (2020.01); G06N 3/045 (2023.01); G06N 20/20 (2019.01); G10L 13/047 (2013.01); G10L 13/06 (2013.01); G10L 25/30 (2013.01)] | 11 Claims |
1. A method for generating speech through neural text-to-speech (TTS) synthesis, comprising:
obtaining a text input having a sequence;
generating phoneme or character level text information based on the text input;
generating context-sensitive text information among words based on the text input, the context-sensitive text information having word level text information where generating the context-sensitive text information comprises:
identifying a word sequence from the text input;
up-sampling the word sequence to align with the text input;
generating a word embedding vector sequence of the word sequence and a phoneme vector sequence;
generating sentence level text information having a grammatical parsing information sequence, wherein generating the sentence level text information comprises:
performing grammatical parsing on the text input to obtain a grammatical structure of the text input; and
generating the grammatical parsing information sequence based on the grammatical structure of the text input by:
extracting grammatical parsing information of each word in the text input from the grammatical structure;
up-sampling the grammatical parsing information of each word to align with corresponding phonemes or characters in a phoneme or character sequence of the text input thereby generating a grammatical parsing information sequence; and
combining a phoneme vector sequence, the word embedding vector sequence and the grammatical parsing information sequence;
generating a text feature via a multi-input encoder coupled to receive the phoneme or character level text information and the word embedding vector sequence as inputs;
generating acoustic features from the text feature via a decoder; and
generating a speech waveform corresponding to the text input based at least on the text feature.
|