US 12,469,186 B2
	Systems and methods for generating multimodal data using a single-tower architecture with a data generation subsystem
Mostafa Dehghani, Amsterdam (NL); Phillip Lippe, Amsterdam (NL); Emiel Hoogeboom, Amsterdam (NL); and Jonathan Heek, Hilversum (NL)
Assigned to GDM Holding LLC, Mountain View, CA (US)
Filed by GDM Holding LLC, Mountain View, CA (US)
Filed on Apr. 29, 2025, as Appl. No. 19/193,787.
Claims priority of provisional application 63/640,140, filed on Apr. 29, 2024.
Prior Publication US 2025/0336101 A1, Oct. 30, 2025
Int. Cl. G06N 20/00 (2019.01); G06F 40/284 (2020.01); G06F 40/40 (2020.01); G06T 11/00 (2006.01); G06V 10/774 (2022.01); G06V 10/82 (2022.01); G06V 30/19 (2022.01); G10L 25/18 (2013.01); G10L 25/30 (2013.01)

CPC G06T 11/00 (2013.01) [G06F 40/284 (2020.01); G06F 40/40 (2020.01); G06V 10/774 (2022.01); G06V 10/82 (2022.01); G06V 30/19147 (2022.01); G10L 25/18 (2013.01); G10L 25/30 (2013.01)]

25 Claims

1. A computer-implemented method of generating multimodal data using a system comprising a token generation neural network, and an image generation subsystem comprising an image generation neural network, the method comprising:

receiving a prompt sequence that defines an input sequence of multimodal tokens, and processing the input sequence of multimodal tokens using the token generation neural network to generate an output sequence of multimodal tokens, wherein a multimodal token represents a data element of one of a plurality of modalities;

wherein generating the output sequence of multimodal tokens comprises autoregressively, for each successive position in the output sequence of multimodal tokens:

processing a combined sequence comprising the input sequence of multimodal tokens and a current output sequence of multimodal tokens, using the token generation neural network, to generate a next multimodal token for the output sequence of multimodal tokens, and appending the next multimodal token to the current output sequence of multimodal tokens;

the method further comprising, in response to the next multimodal token being a start-of-image token:

generating an image using the image generation subsystem conditioned on features representing the current output sequence of multimodal tokens obtained from the token generation neural network;

processing the image to convert pixels of the image into a sequence of image tokens, each image token comprising a block encoding of values of the pixels in a different region of the image that maps a set of values of the pixels to a respective image token; and

appending the sequence of image tokens to the current output sequence of multimodal tokens as the next multimodal tokens in the output sequence of multimodal tokens.