CPC G10L 25/30 (2013.01) [G06N 3/045 (2023.01); G06N 3/08 (2013.01); G06N 5/046 (2013.01); G06N 7/01 (2023.01); G10L 13/047 (2013.01); G10L 13/08 (2013.01); G10L 25/18 (2013.01)] | 20 Claims |
9. A system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising:
generating, from an input data representing a text input, an output sequence of audio data representing the input data, wherein the output sequence of audio data comprises a respective audio output sample for each of a plurality of time steps, including, for each of the plurality of time steps:
generating a mel-frequency spectrogram for a timestep of the plurality of timesteps by processing a representation of a respective portion of the input data using a decoder neural network, wherein the decoder neural network is an autoregressive neural network comprising an LSTM subnetwork, a linear transform, and a convolutional subnetwork;
generating a probability distribution over a plurality of possible audio output samples for the time step by processing the mel-frequency spectrogram for the time step using a vocoder neural network; and
selecting an audio output sample for the time step from the plurality of possible audio output samples in accordance with the probability distribution.
|