US 12,488,778 B2
	Normalizing flows with neural splines for high-quality speech synthesis
Kevin Shih, Cambridge, MA (US); José Rafael Valle Gomes da Costa, Berkeley, CA (US); Rohan Badlani, Santa Clara, CA (US); João Felipe Santos, Vancouver (CA); and Bryan Catanzaro, Los Altos Hills, CA (US)
Assigned to NVIDIA Corporation, Santa Clara, CA (US)
Filed by NVIDIA Corporation, Santa Clara, CA (US)
Filed on Jan. 20, 2023, as Appl. No. 18/099,840.
Claims priority of provisional application 63/392,406, filed on Jul. 26, 2022.
Prior Publication US 2024/0038212 A1, Feb. 1, 2024
Int. Cl. G10L 13/027 (2013.01); G10L 13/08 (2013.01); G10L 25/30 (2013.01)

CPC G10L 13/027 (2013.01) [G10L 13/08 (2013.01); G10L 25/30 (2013.01)]

19 Claims

1. A method to obtain a speech model, the method comprising:

filling, with synthetic values, one or more gaps in a time series of a speech characteristics (SC);

identifying, using one or more iterations, a mapping of the time series of the SC on a target distribution of a latent variable,

wherein each of the one or more iterations comprises a non-linear invertible transformation of at least a subset of the time series of the SC, and

wherein parameters of the non-linear invertible transformations are determined using a neural network that approximates a statistics of the time series of the SC with a statistics predicted for the SC based on the identified mapping and the target distribution of the latent variable; and

generating, using the identified mapping, a speech signal corresponding to an input text.