CPC G06V 10/776 (2022.01) [G06F 16/56 (2019.01); G06F 16/5846 (2019.01); G06F 40/126 (2020.01); G06F 40/166 (2020.01); G06F 40/284 (2020.01); G06V 10/761 (2022.01); G06V 10/806 (2022.01)] | 20 Claims |
1. A system for vision-language distribution alignment, the system comprising:
a data interface receiving a first batch of image samples and a second batch of text samples;
a memory storing a plurality of processor-executable instructions, an image encoder for encoding the first batch of image samples into a first plurality of image feature representations stored at a first feature queue, and
a text encoder for encoding the second batch of text samples into a second plurality of text feature representations stored at a second feature queue; and
a processor executing the plurality of processor-executable instructions to perform operations comprising:
computing an image-to-image similarity between at least one image feature representation and the first plurality of image feature representations in the first feature queue;
computing a text-to-text similarity between at least one text feature representation and the second plurality of text feature representations in the second feature queue;
computing a cross-modal alignment loss based on the image-to-image similarity and the text-to-text similarity; and
updating the image encoder and the text encoder based at least in part on the cross- modal alignment loss.
|