US 12,361,926 B2
End-to-end neural text-to-speech model with prosody control
Ioan Calapodescu, Grenoble (FR); Inyoung Kim, Meylan (FR); Laurent Besacier, Seyssinet Pariset (FR); and Siddique Latif, Queensland (AU)
Assigned to NAVER CORPORATION, Seongnam-si (KR)
Filed by NAVER CORPORATION, Seongnam-si (KR)
Filed on Sep. 23, 2022, as Appl. No. 17/934,836.
Claims priority of provisional application 63/266,215, filed on Dec. 30, 2021.
Prior Publication US 2023/0215421 A1, Jul. 6, 2023
Int. Cl. G10L 13/047 (2013.01); G06F 40/169 (2020.01); G06F 40/237 (2020.01); G06F 40/284 (2020.01); G10L 13/10 (2013.01)
CPC G10L 13/10 (2013.01) [G06F 40/169 (2020.01); G06F 40/237 (2020.01); G06F 40/284 (2020.01); G10L 13/047 (2013.01); G10L 2013/105 (2013.01)] 36 Claims
OG exemplary drawing
 
1. A computer implemented method for generating a neural text-to-speech (TTS) model, the method comprising:
inputting an annotated set of text documents into the TTS model stored in a memory, the annotated set of text documents including annotations inserted therein to indicate prosodic features, wherein, for each of a plurality of groups of text documents in the annotated set, each text document in the group is annotated to indicate prosodic features based on syntactic, semantic, and/or pragmatic features of the text document, wherein the prosodic features are for a focus type selected among a set of focus types, and wherein the annotations comprise control tags and/or control tokens;
training, using a processor, the TTS model using the annotated set of text documents and a corresponding dataset of speech representations of the text documents that include prosody associated with the indicated prosodic features, wherein the prosody comprises pitch, duration, rhythm, pause, and/or intensity of one or more documents, utterances, words, chains of words, sub-words, phonemes, and/or syllables;
wherein the trained TTS model learns to associate the prosody with the annotations;
wherein the trained TTS model processes an input text to generate speech signals, and a speech synthesizer generates speech from the produced speech signals.