| CPC G06T 11/00 (2013.01) [G06F 40/40 (2020.01)] | 20 Claims |

|
1. A system comprising:
one or more storage media storing instructions; and
one or more processors configured to execute the instructions to cause the system to:
receive a prompt describing a desired characteristic of an image;
generate, using a set of encoding models, a prompt encoding based on the prompt;
generate, using a first transformer block of a diffusion transformer model, a first prompt embedding and a first image embedding based on the prompt encoding and a noise input;
generate, using a second transformer block of the diffusion transformer model, a second image embedding based on the first image embedding and the first prompt embedding; and
generate the image based on the second image embedding.
|