US 11,948,078 B2
	Joint representation learning from images and text
Arash Vahdat, Santa Clara, CA (US); Tanmay Gupta, Santa Clara, CA (US); Xiaodong Yang, Santa Clara, CA (US); and Jan Kautz, Santa Clara, CA (US)
Assigned to NVIDIA Corporation, Santa Clara, CA (US)
Filed by Nvidia Corporation, Santa Clara, CA (US)
Filed on Aug. 21, 2020, as Appl. No. 17/000,048.
Claims priority of provisional application 62/891,155, filed on Aug. 23, 2019.
Prior Publication US 2021/0056353 A1, Feb. 25, 2021
Int. Cl. G06N 3/08 (2023.01); G06F 18/214 (2023.01); G06F 18/22 (2023.01); G06V 10/74 (2022.01); G06V 10/82 (2022.01); G06V 30/19 (2022.01); G06V 30/262 (2022.01)

CPC G06N 3/08 (2013.01) [G06F 18/2148 (2023.01); G06F 18/22 (2023.01); G06V 10/761 (2022.01); G06V 10/82 (2022.01); G06V 30/1916 (2022.01); G06V 30/19173 (2022.01); G06V 30/274 (2022.01)]

20 Claims

1. A method of visual representation learning, comprising:

receiving a set of image embeddings from an image representation model and a set of text embeddings from a text representation model; and

training, using a neural network and employing mutual information, a critic function by learning relationships between the set of image embeddings and the set of text embeddings.