US 12,112,523 B2
	Systems and methods for vision-language distribution alignment
Shu Zhang, Fremont, CA (US); Junnan Li, Singapore (SG); Ran Xu, Mountain View, CA (US); Caiming Xiong, Menlo Park, CA (US); and Chetan Ramaiah, San Bruno, CA (US)
Assigned to Salesforce, Inc., San Francisco, CA (US)
Filed by Salesforce, Inc., San Francisco, CA (US)
Filed on Jan. 31, 2022, as Appl. No. 17/589,725.
Claims priority of provisional application 63/281,471, filed on Nov. 19, 2021.
Prior Publication US 2023/0162490 A1, May 25, 2023
Int. Cl. G06V 10/776 (2022.01); G06F 16/56 (2019.01); G06F 16/583 (2019.01); G06F 40/126 (2020.01); G06F 40/166 (2020.01); G06F 40/284 (2020.01); G06V 10/74 (2022.01); G06V 10/80 (2022.01)

CPC G06V 10/776 (2022.01) [G06F 16/56 (2019.01); G06F 16/5846 (2019.01); G06F 40/126 (2020.01); G06F 40/166 (2020.01); G06F 40/284 (2020.01); G06V 10/761 (2022.01); G06V 10/806 (2022.01)]

20 Claims

1. A system for vision-language distribution alignment, the system comprising:

a data interface receiving a first batch of image samples and a second batch of text samples;

a memory storing a plurality of processor-executable instructions, an image encoder for encoding the first batch of image samples into a first plurality of image feature representations stored at a first feature queue, and

a text encoder for encoding the second batch of text samples into a second plurality of text feature representations stored at a second feature queue; and

a processor executing the plurality of processor-executable instructions to perform operations comprising:

computing an image-to-image similarity between at least one image feature representation and the first plurality of image feature representations in the first feature queue;

computing a text-to-text similarity between at least one text feature representation and the second plurality of text feature representations in the second feature queue;

computing a cross-modal alignment loss based on the image-to-image similarity and the text-to-text similarity; and

updating the image encoder and the text encoder based at least in part on the cross- modal alignment loss.