US 11,908,448 B2
	Parallel tacotron non-autoregressive and controllable TTS
Isaac Elias, Mountain View, CA (US); Jonathan Shen, Mountain View, CA (US); Yu Zhang, Mountain View, CA (US); Ye Jia, Mountain View, CA (US); Ron J. Weiss, New York, NY (US); Yonghui Wu, Fremont, CA (US); and Byungha Chun, Tokyo (JP)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on May 21, 2021, as Appl. No. 17/327,076.
Claims priority of provisional application 63/094,834, filed on Oct. 21, 2020.
Prior Publication US 2022/0122582 A1, Apr. 21, 2022
Int. Cl. G10L 13/08 (2013.01); G10L 13/047 (2013.01); G06F 40/126 (2020.01); G10L 21/10 (2013.01); G06N 3/08 (2023.01); G06N 3/088 (2023.01); G06N 3/044 (2023.01); G06N 3/045 (2023.01); G06N 3/048 (2023.01)

CPC G10L 13/08 (2013.01) [G06F 40/126 (2020.01); G06N 3/044 (2023.01); G06N 3/045 (2023.01); G06N 3/08 (2013.01); G06N 3/088 (2013.01); G10L 13/047 (2013.01); G10L 21/10 (2013.01); G06N 3/048 (2023.01)]

22 Claims

1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations for training a non-autoregressive text-to-speech (TTS) model, the operations comprising:

receiving training data including a reference audio signal and a corresponding input text sequence, the reference audio signal comprising a spoken utterance and the input text sequence corresponds to a transcript of the reference audio signal;

encoding, using a residual encoder, the reference audio signal into a variational embedding, the variational embedding disentangling style/prosody information from the reference audio signal;

encoding, using a text encoder, the input text sequence into an encoded text sequence;

predicting, using a duration decoder comprising a stack of self-attention blocks followed by two independent projections, based on the encoded text sequence and the variational embedding, a phoneme duration for each phoneme in the input text sequence by:

predicting, using a sigmoid activation following a first one of the two independent projections, a probability of non-zero duration for each phoneme;

predicting, using a softplus activation following a second one of the two independent projections, the phoneme duration for each phoneme;

determining whether the probability of non-zero duration predicted for the corresponding phoneme is less than a threshold value; and

when the probability of non-zero duration is less than the threshold value, zeroing out the phoneme duration predicted for the corresponding phoneme;

determining a phoneme duration loss based on the predicted phoneme durations and a reference phoneme duration sampled from the reference audio signal for each phoneme in the input text sequence

generating, as output from a non-autoregressive spectrogram decoder comprising a stack of self-attention blocks, based on an output of the duration decoder, multiple predicted mel-frequency spectrogram sequences for the input text sequence;

determining a final spectrogram loss based on the multiple predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence sampled from the reference audio signal; and

training the TTS model based on the final spectrogram loss and the corresponding phoneme duration loss determined for each phoneme in the input text sequence.