US 12,142,015 B2
	Auto-regressive video generation neural networks
Oscar Carl Tackstrom, Stockholm (SE); Jakob D. Uszkoreit, Mountain View, CA (US); and Dirk Weissenborn, Berlin (DE)
Assigned to Google LLC, Mountain View, CA (US)
Appl. No. 17/609,668
Filed by Google LLC, Mountain View, CA (US)
PCT Filed May 22, 2020, PCT No. PCT/US2020/034185 § 371(c)(1), (2) Date Nov. 8, 2021, PCT Pub. No. WO2020/237136, PCT Pub. Date Nov. 26, 2020.
Claims priority of provisional application 62/852,271, filed on May 23, 2019.
Prior Publication US 2022/0215594 A1, Jul. 7, 2022
Int. Cl. G06T 9/00 (2006.01); G06N 3/045 (2023.01)

CPC G06T 9/002 (2013.01) [G06N 3/045 (2023.01)]

20 Claims

1. A computer-implemented method for generating a video, the method comprising:

generating an initial output video including a plurality of frames, wherein each of the frames has a plurality of channels, each channel being a two-dimensional image and indexed by a respective channel index from a set of channel indices of the initial output video, and wherein, for each channel, each pixel in the channel is assigned a predetermined pixel value or is padded with a blank pixel;

identifying a partitioning of the initial output video into a set of channel slices that are indexed according to a particular slice order, wherein each channel slice is a down sampling of a channel stack from a set of channel stacks, and wherein each channel stack in the set corresponds to a respective channel index and is a stack of channels having the respective channel index according to time;

initializing, for each channel stack in the set of channel stacks, a set of fully-generated channel slices;

repeatedly performing the following operations according to the particular slice order:

processing, using an encoder neural network, a current output video comprising the current set of fully-generated channel slices of all channel stacks to generate an encoded conditioning channel slice, wherein the encoder neural network comprises a plurality of encoding self-attention layers, wherein each of the plurality encoding self-attention layers is configured to receive as input a padded video that is a representation of the current output video and includes a set of channel stacks, divide the padded video into a plurality of video blocks, and apply a self-attention mechanism on each of the plurality of video blocks,

processing, using a decoder neural network, the encoded conditioning channel slice to generate a next fully-generated channel slice, and

adding the next fully generated channel slice to the current set of fully-generated channel slices of the channel stack;

generating, for each of the channel indices, a respective fully-generated channel stack using the respective fully generated channel slices; and

generating a fully-generated output video using the fully-generated channel stacks generated for the channel indices.