US 12,462,592 B2
	Systems and methods for a vision-language pretraining framework
Junnan Li, Singapore (SG); and Chu Hong Hoi, Singapore (SG)
Assigned to Salesforce, Inc., San Francisco, CA (US)
Filed by Salesforce, Inc., San Francisco, CA (US)
Filed on Jan. 27, 2023, as Appl. No. 18/160,664.
Claims priority of provisional application 63/424,413, filed on Nov. 10, 2022.
Prior Publication US 2024/0161520 A1, May 16, 2024
Int. Cl. G06V 20/70 (2022.01); G06F 40/10 (2020.01); G06F 40/126 (2020.01); G06F 40/284 (2020.01); G06F 40/35 (2020.01); G06F 40/40 (2020.01); G06N 20/00 (2019.01); G06T 9/00 (2006.01); G06V 10/74 (2022.01); G06V 10/764 (2022.01); G06V 10/774 (2022.01)

CPC G06V 20/70 (2022.01) [G06F 40/10 (2020.01); G06F 40/126 (2020.01); G06F 40/284 (2020.01); G06F 40/35 (2020.01); G06F 40/40 (2020.01); G06N 20/00 (2019.01); G06T 9/00 (2013.01); G06V 10/74 (2022.01); G06V 10/764 (2022.01); G06V 10/774 (2022.01)]

20 Claims

12. A system for pre-training a multimodal framework for vision-language tasks, the system comprising:

a communication interface receiving an image and a text accompanying the image;

a memory storing an image encoder, a query transformer, a pretrained language model, and a plurality of processor-executable instructions; and

one or more processors executing the instructions to perform operations including:

encoding, by the image encoder, the image into an image representation;

transforming, by the query transformer, the image representation and a set of queries into a transformed representation;

generating, by the query transformer, a text representation based at least in part from the text;

training the query transformer according to one or more vision-language training objectives computed based on the transformed representation and the text representation while keeping the image encoder frozen;

generating, by the pretrained language model, a decoded output text based on an output representation from the updated query transformer;

computing a loss based on the decoded output text and the text accompanying the image; and

training the query transformer based on the loss while keeping the image encoder and the pretrained language model frozen, wherein the pretrained language model includes a text decoder, and wherein the generating, by the pretrained language model, the decoded output text based on the output representation from the updated query transformer comprises:

projecting, via a fully connected layer, the output representation to a same dimension with the pretrained language model; and

generating, by the text decoder, the decoded output text based on the projected output representation.