| CPC G06T 9/002 (2013.01) [G06N 3/045 (2023.01)] | 20 Claims |

|
1. A computer-implemented method for generating a video, the method comprising:
generating an initial output video including a plurality of frames, wherein each of the frames has a plurality of channels, each channel being a two-dimensional image and indexed by a respective channel index from a set of channel indices of the initial output video, and wherein, for each channel, each pixel in the channel is assigned a predetermined pixel value or is padded with a blank pixel;
identifying a partitioning of the initial output video into a set of channel slices that are indexed according to a particular slice order, wherein each channel slice is a down sampling of a channel stack from a set of channel stacks, and wherein each channel stack in the set corresponds to a respective channel index and is a stack of channels having the respective channel index according to time;
initializing, for each channel stack in the set of channel stacks, a set of fully-generated channel slices;
repeatedly performing the following operations according to the particular slice order:
processing, using an encoder neural network, a current output video comprising the current set of fully-generated channel slices of all channel stacks to generate an encoded conditioning channel slice, wherein the encoder neural network comprises a plurality of encoding self-attention layers, wherein each of the plurality encoding self-attention layers is configured to receive as input a padded video that is a representation of the current output video and includes a set of channel stacks, divide the padded video into a plurality of video blocks, and apply a self-attention mechanism on each of the plurality of video blocks,
processing, using a decoder neural network, the encoded conditioning channel slice to generate a next fully-generated channel slice, and
adding the next fully generated channel slice to the current set of fully-generated channel slices of the channel stack;
generating, for each of the channel indices, a respective fully-generated channel stack using the respective fully generated channel slices; and
generating a fully-generated output video using the fully-generated channel stacks generated for the channel indices.
|