| CPC G06N 5/04 (2013.01) [G06N 3/08 (2013.01); G06N 20/00 (2019.01); G06T 2207/00 (2013.01)] | 20 Claims |

|
1. A method of training a machine learning model, the method comprising:
identifying a training set comprising a plurality of images and a plurality of captions corresponding to the images;
encoding the images using an image encoder to produce encoded images;
encoding the captions using a text encoder to produce encoded text;
computing a multi-modal loss function based on the encoded images and the encoded text, the multi-modal loss function comprising at least one image loss term, at least one text loss term, and at least one cross-modal term; and
training the image encoder and the text encoder based on the multi-modal loss function.
|