| CPC G10L 13/08 (2013.01) [G10L 13/04 (2013.01)] | 20 Claims |

|
1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations for training a non-autoregressive text-to-speech (TTS) model, the operations comprising:
obtaining a sequence representation of an encoded text sequence concatenated with a variational embedding;
using a duration model network:
predicting, based on the sequence representation, a phoneme duration for each phoneme represented by the encoded text sequence;
based on the predicted phoneme durations, generating, for each phoneme represented by the encoded text sequence, respective start and end boundaries;
mapping, based on a number of phonemes represented by the encoded text sequence and a number of reference frames in a reference mel-frequency spectrogram sequence, the respective start and end boundaries generated for each phoneme into respective grid matrices;
based on the respective grid matrices mapped from the start and end boundaries, learning, using a first function conditioned on the sequence representation, an interval representation matrix;
determining a product of the interval representation matrix and the sequence representation; and
upsampling, based on the product of the interval representation matrix and the sequence representation, the sequence representation into an upsampled output specifying a number of frames;
generating, as output from a spectrogram decoder comprising a stack of one or more self-attention blocks, based on the upsampled output, one or more predicted mel-frequency spectrogram sequences for the encoded text sequence;
determining a final spectrogram loss based on the one or more predicted mel-frequency spectrogram sequences and the reference mel-frequency spectrogram sequence; and
training the TTS model based on the final spectrogram loss.
|