| CPC G06T 11/60 (2013.01) [G06T 5/50 (2013.01); G06T 5/60 (2024.01); G06T 5/70 (2024.01); G06T 2207/20081 (2013.01); G06T 2207/20084 (2013.01); G06T 2207/30196 (2013.01)] | 16 Claims |

|
1. A method comprising:
receiving an image including an object;
receiving a text including at least one instruction for transforming the object;
extracting, from the image, a foreground image corresponding to the object;
determining, based on the image, a mask corresponding to the object;
encoding the foreground image into an image latent;
extracting, from the image, using a third neural network, at least one feature associated with the object, wherein the object includes a person and the at least one feature includes one or more of the following: an ethnicity of the person, a gender of the person, an age of the person, and an orientation of body of the person with respect to a plane of the image;
updating the text with the at least one feature to obtain an updated text;
encoding the updated text into a text embedding;
randomly generating a first noise for the image latent;
combining the first noise and the image latent to obtain a noisy image latent;
providing the noisy image latent and the text embedding to a first neural network to generate a second noise;
removing the second noise from the noisy image latent to obtain a denoised image latent;
decoding, using a second neural network, the denoised image latent into an output image; and
generating a result image based on the mask, the output image, and the image.
|