US 12,288,380 B2
	Systems and methods for unified vision-language understanding and generation
Junnan Li, Singapore (SG); and Chu Hong Hoi, Singapore (SG)
Assigned to Salesforce, Inc., San Francisco, CA (US)
Filed by Salesforce, Inc., San Francisco, CA (US)
Filed on May 16, 2022, as Appl. No. 17/745,634.
Claims priority of provisional application 63/301,978, filed on Jan. 21, 2022.
Prior Publication US 2023/0237773 A1, Jul. 27, 2023
Int. Cl. G06V 10/774 (2022.01); G06F 40/126 (2020.01); G06F 40/284 (2020.01); G06T 9/00 (2006.01); G06V 10/764 (2022.01); G06V 10/80 (2022.01)

CPC G06V 10/774 (2022.01) [G06F 40/126 (2020.01); G06F 40/284 (2020.01); G06T 9/00 (2013.01); G06V 10/764 (2022.01); G06V 10/803 (2022.01)]

20 Claims

1. A method of generating enhanced vison-language training data, the method comprising:

receiving, from a communication interface, a first training dataset of image-text pairs and a second training dataset of annotated image-text pairs;

fine-tuning an image-grounded text decoder and an image-grounded text encoder using the second training dataset of annotated image-text pairs;

generating, by the fine-tuned image-grounded text decoder, a predicted text based on a training image from the first training dataset;

generating, by the fine-tuned image-grounded text encoder, a filtering decision based on the training image and the predicted text;

adding the training image and the predicted text to form a third training dataset of image-text pairs depending on the filter decision; and

training a vision-language model using the third training dataset of image-text pairs.