US 12,192,547 B2
High-resolution video generation using image diffusion models
Karsten Julian Kreis, Vancouver (CA); Robin Rombach, Heidelberg (DE); Andreas Blattmann, Waldkirch (DE); Seung Wook Kim, Toronto (CA); Huan Ling, Toronto (CA); Sanja Fidler, Toronto (CA); and Tim Dockhorn, Waterloo (CA)
Assigned to NVIDIA Corporation, Santa Clara, CA (US)
Filed by NVIDIA Corporation, Santa Clara, CA (US)
Filed on Mar. 10, 2023, as Appl. No. 18/181,729.
Claims priority of provisional application 63/426,037, filed on Nov. 16, 2022.
Prior Publication US 2024/0171788 A1, May 23, 2024
Int. Cl. H04N 21/2343 (2011.01); G06T 9/00 (2006.01); G06V 10/24 (2022.01); G06V 10/25 (2022.01); G06V 10/82 (2022.01); H04N 7/01 (2006.01)
CPC H04N 21/234363 (2013.01) [G06T 9/00 (2013.01); G06V 10/24 (2022.01); G06V 10/25 (2022.01); G06V 10/82 (2022.01); H04N 7/0117 (2013.01)] 16 Claims
OG exemplary drawing
 
1. A processor, comprising:
one or more circuits to:
align a plurality of images into frames of a first video using a neural network model comprising a latent diffusion model (LDM), wherein the first video has a first spatial resolution, the LDM comprises:
an encoder to map an input from an image space to a latent space; and
a decoder to map latent encoding from the latent space to the image space; and
generate a second video having a second spatial resolution by up-sampling the first video using an up-sampler neural network model, wherein the second spatial resolution is higher than the first spatial resolution, wherein the decoder is updated according to one or more temporal incoherencies in mapping the latent encoding from the latent space to the image space.