US 12,223,439 B2
	Visual-semantic representation learning via multi-modal contrastive training
Xin Yuan, Chicago, IL (US); Zhe Lin, Bellevue, WA (US); Jason Wen Yong Kuen, Santa Clara, CA (US); Jianming Zhang, Campbell, CA (US); Yilin Wang, Sunnyvale, CA (US); Ajinkya Kale, San Jose, CA (US); and Baldo Faieta, San Francisco, CA (US)
Assigned to ADOBE INC., San Jose, CA (US)
Filed by ADOBE INC., San Jose, CA (US)
Filed on Mar. 3, 2021, as Appl. No. 17/190,668.
Prior Publication US 2022/0284321 A1, Sep. 8, 2022
Int. Cl. G06N 5/04 (2023.01); G06N 3/08 (2023.01); G06N 20/00 (2019.01)

CPC G06N 5/04 (2013.01) [G06N 3/08 (2013.01); G06N 20/00 (2019.01); G06T 2207/00 (2013.01)]

20 Claims

1. A method of training a machine learning model, the method comprising:

identifying a training set comprising a plurality of images and a plurality of captions corresponding to the images;

encoding the images using an image encoder to produce encoded images;

encoding the captions using a text encoder to produce encoded text;

computing a multi-modal loss function based on the encoded images and the encoded text, the multi-modal loss function comprising at least one image loss term, at least one text loss term, and at least one cross-modal term; and

training the image encoder and the text encoder based on the multi-modal loss function.