US 12,288,547 B2
	Generating audio data using unaligned text inputs with an adversarial network
Jeffrey Donahue, London (GB); Karen Simonyan, London (GB); Sander Etienne Lea Dieleman, London (GB); Mikolaj Binkowski, Amsterdam (NL); and Erich Konrad Elsen, San Francisco, CA (US)
Assigned to DeepMind Technologies Limited, London (GB)
Filed by DeepMind Technologies Limited, London (GB)
Filed on Jun. 4, 2021, as Appl. No. 17/339,834.
Claims priority of provisional application 63/035,519, filed on Jun. 5, 2020.
Prior Publication US 2021/0383789 A1, Dec. 9, 2021
Int. Cl. G10L 13/047 (2013.01); G06N 3/04 (2023.01); G06N 3/08 (2023.01)

CPC G10L 13/047 (2013.01) [G06N 3/04 (2013.01); G06N 3/08 (2013.01)]

35 Claims

1. A computer-implemented method of training a feedforward generative neural network having a plurality of generative parameters and configured to generate output audio examples using conditioning text inputs,

wherein each conditioning text input comprises a respective linguistic feature representation at each of a plurality of input time steps,

wherein the feedforward generative neural network is configured to receive a generative input comprising a conditioning text input and to process the generative input to generate an audio output that comprises respective audio samples at each of a plurality of output time steps, and

wherein the method comprises:

obtaining a training conditioning text input;

processing a training generative input comprising the training conditioning text input using the feedforward generative neural network in accordance with current values of the generative parameters to generate a training audio output having a plurality of non-overlapping time windows, the processing comprising:

processing the training generative input using an alignment neural network to generate an aligned conditioning sequence comprising a respective feature representation at each of a plurality of first time steps, wherein each of the plurality of first time steps corresponds to a different one of the plurality of non-overlapping time windows in the training audio output, and wherein processing the training generative input using an alignment neural network comprises:

processing the training generative input using a first subnetwork to generate an intermediate sequence having a respective intermediate element at each of a plurality of intermediate time steps;

processing the intermediate sequence using a second subnetwork to generate, for each intermediate element, a length prediction characterizing a predicted length of time for the intermediate element, wherein the predicted length of time represents a time duration that speech represented by the intermediate element will be spoken in the training audio output; and

processing the respective length predictions for the intermediate elements to generate the aligned conditioning sequence by non-uniformly interpolating the intermediate sequence using the respective length predictions of the intermediate elements to generate the feature representations at each of the plurality of first time steps in the aligned conditioning sequence, wherein non-uniformly interpolating the intermediate sequence comprises, for each first time step:

determining a respective weight value for each of the intermediate elements from the respective length predictions for the intermediate elements that each represent the time duration that speech represented by the intermediate element will be spoken in the training audio output, and

combining the intermediate elements using the respective weight values for the intermediate elements to generate the feature representation at the first time step; and

processing the aligned conditioning sequence using a generator neural network to generate the training audio output;

processing the training audio output using each of one or more discriminators, wherein each discriminator predicts whether the training audio output is a real audio example or a synthetic audio example;

determining a final prediction using the respective predictions of the one or more discriminators; and

determining an update to the current values of the generative parameters to increase a first error in the final prediction.