US 12,249,315 B2
	Unsupervised parallel tacotron non-autoregressive and controllable text-to-speech
Isaac Elias, Mountain View, CA (US); Byungha Chun, Tokyo (JP); Jonathan Shen, Mountain View, CA (US); Ye Jia, Mountain View, CA (US); Yu Zhang, Mountain View, CA (US); and Yonghui Wu, Fremont, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Oct. 31, 2023, as Appl. No. 18/499,031.
Application 18/499,031 is a continuation of application No. 17/326,542, filed on May 21, 2021, granted, now 11,823,656.
Claims priority of provisional application 63/164,503, filed on Mar. 22, 2021.
Prior Publication US 2024/0062743 A1, Feb. 22, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 13/08 (2013.01); G10L 13/04 (2013.01)

CPC G10L 13/08 (2013.01) [G10L 13/04 (2013.01)]

20 Claims

1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations for training a non-autoregressive text-to-speech (TTS) model, the operations comprising:

obtaining a sequence representation of an encoded text sequence concatenated with a variational embedding;

using a duration model network:

predicting, based on the sequence representation, a phoneme duration for each phoneme represented by the encoded text sequence;

based on the predicted phoneme durations, generating, for each phoneme represented by the encoded text sequence, respective start and end boundaries;

mapping, based on a number of phonemes represented by the encoded text sequence and a number of reference frames in a reference mel-frequency spectrogram sequence, the respective start and end boundaries generated for each phoneme into respective grid matrices;

based on the respective grid matrices mapped from the start and end boundaries, learning, using a first function conditioned on the sequence representation, an interval representation matrix;

determining a product of the interval representation matrix and the sequence representation; and

upsampling, based on the product of the interval representation matrix and the sequence representation, the sequence representation into an upsampled output specifying a number of frames;

generating, as output from a spectrogram decoder comprising a stack of one or more self-attention blocks, based on the upsampled output, one or more predicted mel-frequency spectrogram sequences for the encoded text sequence;

determining a final spectrogram loss based on the one or more predicted mel-frequency spectrogram sequences and the reference mel-frequency spectrogram sequence; and

training the TTS model based on the final spectrogram loss.