| CPC G06V 10/774 (2022.01) [G06F 40/126 (2020.01); G06F 40/284 (2020.01); G06T 9/00 (2013.01); G06V 10/764 (2022.01); G06V 10/803 (2022.01)] | 20 Claims |

|
1. A method of generating enhanced vison-language training data, the method comprising:
receiving, from a communication interface, a first training dataset of image-text pairs and a second training dataset of annotated image-text pairs;
fine-tuning an image-grounded text decoder and an image-grounded text encoder using the second training dataset of annotated image-text pairs;
generating, by the fine-tuned image-grounded text decoder, a predicted text based on a training image from the first training dataset;
generating, by the fine-tuned image-grounded text encoder, a filtering decision based on the training image and the predicted text;
adding the training image and the predicted text to form a third training dataset of image-text pairs depending on the filter decision; and
training a vision-language model using the third training dataset of image-text pairs.
|