| CPC G10L 13/047 (2013.01) [G10L 15/02 (2013.01); G10L 15/16 (2013.01); G10L 15/1815 (2013.01); G10L 15/22 (2013.01); G10L 25/18 (2013.01); G10L 2015/025 (2013.01)] | 20 Claims | 

| 
               1. A method comprising: 
            receiving first audio data representing first human speech; 
                processing the first audio data using a neural network encoder to generate first encoded data, the first encoded data representing semantic information and feature data of the first human speech, wherein the feature data represents one or more of prosody, timbre, speaker identity, or speaking style of the first human speech; 
                training an acoustic/semantic language model (ASLM) to process a first portion of the first encoded data to predict a second portion of the first encoded data, the second portion occurring after the first portion; 
                receiving second audio data representing second human speech having voice characteristics to be reproduced by a text-to-speech (TTS) model; 
                processing the second audio data using the neural network encoder to generate second encoded data, the second encoded data representing semantic information and feature data of the second human speech, wherein the feature data represents one or more of prosody, timbre, speaker identity, or speaking style of the first human speech; 
                processing the second encoded data using the ASLM to generate third encoded data, the third encoded data representing a predicted continuation of the second human speech; 
                processing the third encoded data using a neural network decoder to generate third audio data; and 
                training the TTS model using the second audio data and the third audio data. 
               |