| CPC G06V 20/70 (2022.01) [G06F 40/10 (2020.01); G06F 40/126 (2020.01); G06F 40/284 (2020.01); G06F 40/35 (2020.01); G06F 40/40 (2020.01); G06N 20/00 (2019.01); G06T 9/00 (2013.01); G06V 10/74 (2022.01); G06V 10/764 (2022.01); G06V 10/774 (2022.01)] | 20 Claims |

|
12. A system for pre-training a multimodal framework for vision-language tasks, the system comprising:
a communication interface receiving an image and a text accompanying the image;
a memory storing an image encoder, a query transformer, a pretrained language model, and a plurality of processor-executable instructions; and
one or more processors executing the instructions to perform operations including:
encoding, by the image encoder, the image into an image representation;
transforming, by the query transformer, the image representation and a set of queries into a transformed representation;
generating, by the query transformer, a text representation based at least in part from the text;
training the query transformer according to one or more vision-language training objectives computed based on the transformed representation and the text representation while keeping the image encoder frozen;
generating, by the pretrained language model, a decoded output text based on an output representation from the updated query transformer;
computing a loss based on the decoded output text and the text accompanying the image; and
training the query transformer based on the loss while keeping the image encoder and the pretrained language model frozen, wherein the pretrained language model includes a text decoder, and wherein the generating, by the pretrained language model, the decoded output text based on the output representation from the updated query transformer comprises:
projecting, via a fully connected layer, the output representation to a same dimension with the pretrained language model; and
generating, by the text decoder, the decoded output text based on the projected output representation.
|