US 11,908,180 B1
Generating videos using sequences of generative neural networks
Jonathan Ho, New York, NY (US); William Chan, Toronto (CA); Chitwan Saharia, Toronto (CA); Jay Ha Whang, Austin, TX (US); and Tim Salimans, Utrecht (NL)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Mar. 24, 2023, as Appl. No. 18/126,281.
Int. Cl. G06V 10/82 (2022.01); G06T 3/40 (2006.01)
CPC G06V 10/82 (2022.01) [G06T 3/4053 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A method performed by one or more computers, the method comprising:
receiving a text prompt describing a scene;
processing the text prompt using a text encoder neural network to generate a contextual embedding of the text prompt; and
processing the contextual embedding using a sequence of generative neural networks to generate a final video depicting the scene, wherein the sequence of generative neural networks comprises:
an initial generative neural network configured to:
receive the contextual embedding; and
process the contextual embedding to generate, as output, an initial output video having: (i) an initial spatial resolution, and (ii) an initial temporal resolution; and
one or more subsequent generative neural networks each configured to:
receive a respective input comprising an input video generated as output by a preceding generative neural network in the sequence; and
process the respective input to generate, as output, a respective output video having at least one of: (i) a higher spatial resolution, or (ii) a higher temporal resolution, than the input video,
wherein the generative neural networks have been jointly trained on training data comprising a plurality of training examples that each include: (i) respective input text describing a respective scene, and (ii) a corresponding target video depicting the respective scene,
wherein the training examples include image-based training examples,
wherein the respective target video of each image-based training example comprises a respective plurality of individual images each depicting the respective scene described by the corresponding input text, and
wherein jointly training the generative neural networks on the image-based training examples comprised masking out any temporal self-attention and temporal convolution implemented by the generative neural networks.