US 12,333,637 B1
	AI-driven foreground stylization using diffusion and object detection
Roman Degtiarev, Tbilisi (GE); and Tikhon Vorobev, Saint Petersburg (RU)
Assigned to Glam Labs, Inc., San Francisco, CA (US)
Filed by Glam Labs, Inc., San Francisco, CA (US)
Filed on Jan. 27, 2025, as Appl. No. 19/038,477.
Application 19/038,477 is a continuation in part of application No. 18/748,397, filed on Jun. 20, 2024, granted, now 12,211,180.
This patent is subject to a terminal disclaimer.
Int. Cl. G06T 11/60 (2006.01); G06T 5/50 (2006.01); G06T 5/60 (2024.01); G06T 5/70 (2024.01)

CPC G06T 11/60 (2013.01) [G06T 5/50 (2013.01); G06T 5/60 (2024.01); G06T 5/70 (2024.01); G06T 2207/20081 (2013.01); G06T 2207/20084 (2013.01); G06T 2207/30196 (2013.01)]

16 Claims

1. A method comprising:

receiving an image including an object;

receiving a text including at least one instruction for transforming the object;

extracting, from the image, a foreground image corresponding to the object;

determining, based on the image, a mask corresponding to the object;

encoding the foreground image into an image latent;

extracting, from the image, using a third neural network, at least one feature associated with the object, wherein the object includes a person and the at least one feature includes one or more of the following: an ethnicity of the person, a gender of the person, an age of the person, and an orientation of body of the person with respect to a plane of the image;

updating the text with the at least one feature to obtain an updated text;

encoding the updated text into a text embedding;

randomly generating a first noise for the image latent;

combining the first noise and the image latent to obtain a noisy image latent;

providing the noisy image latent and the text embedding to a first neural network to generate a second noise;

removing the second noise from the noisy image latent to obtain a denoised image latent;

decoding, using a second neural network, the denoised image latent into an output image; and

generating a result image based on the mask, the output image, and the image.