US 11,869,483 B2
	Unsupervised alignment for text to speech synthesis using neural networks
Kevin Shih, Santa Clara, CA (US); Jose Rafael Valle Gomes da Costa, Berkeley, CA (US); Rohan Badlani, San Jose, CA (US); Adrian Lancucki, Legnica (PL); Wei Ping, Sunnyvale, CA (US); and Bryan Catanzaro, Los Altos Hills, CA (US)
Assigned to Nvidia Corporation, Santa Clara, CA (US)
Filed by Nvidia Corporation, Santa Clara, CA (US)
Filed on Oct. 7, 2021, as Appl. No. 17/496,636.
Application 17/496,636 is a continuation of application No. 17/496,569, filed on Oct. 7, 2021.
Prior Publication US 2023/0110905 A1, Apr. 13, 2023
Int. Cl. G10L 13/00 (2006.01); G10L 13/08 (2013.01); G10L 13/10 (2013.01); G10L 13/047 (2013.01); G10L 25/90 (2013.01); G06N 3/045 (2023.01); G06N 3/08 (2023.01); G10L 13/033 (2013.01)

CPC G10L 13/047 (2013.01) [G06N 3/045 (2023.01); G06N 3/08 (2013.01); G10L 13/0335 (2013.01); G10L 13/08 (2013.01); G10L 25/90 (2013.01)]

20 Claims

1. A computer-implemented method, comprising:

determining, from a plurality of audio samples including human speech, alignments between a phoneme and a phoneme duration;

generating, from the alignments, an alignment matrix corresponding to a distribution;

generating a set of synthesized training audio samples;

generating, from the set of synthesized training audio samples, synthesized distributions;

training one or more machine learning systems using, at least in part, the synthesized distributions and the distribution; and

removing, after training the one or more machine learning systems, the synthesized distributions to form an inferencing distribution.