| CPC G06N 5/04 (2013.01) [G06N 20/00 (2019.01)] | 25 Claims |

|
1. A computer-implemented method for learning multimodal feature matching comprising:
training an image encoder with a triplet loss that pushes similar images together and dissimilar images apart to obtain encoded images;
training a common classifier on the encoded images by using labeled images to learn text embeddings with corresponding labels; and
training a text encoder while keeping the common classifier in a fixed configuration by using learned text embeddings and corresponding labels for the learned text embeddings, wherein the text encoder is further trained to match a distance of predicted text embeddings which is encoded by the text encoder to a fitted Gaussian distribution on the encoded images.
|