US 12,462,524 B2
	Supervised contrastive learning with multiple positive examples
Dilip Krishnan, Arlington, MA (US); Prannay Khosla, Cambridge, MA (US); Piotr Teterwak, Boston, MA (US); Aaron Yehuda Sarna, Cambridge, MA (US); Aaron Joseph Maschinot, Somerville, MA (US); Ce Liu, Cambridge, MA (US); Philip John Isola, Cambridge, MA (US); Yonglong Tian, Cambridge, MA (US); and Chen Wang, Jersey City, NJ (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Appl. No. 17/920,623
Filed by Google LLC, Mountain View, CA (US)
PCT Filed Apr. 12, 2021, PCT No. PCT/US2021/026836 § 371(c)(1), (2) Date Oct. 21, 2022, PCT Pub. No. WO2021/216310, PCT Pub. Date Oct. 28, 2021.
Claims priority of provisional application 63/013,153, filed on Apr. 21, 2020.
Prior Publication US 2023/0153629 A1, May 18, 2023
Int. Cl. G06N 3/08 (2023.01); G06F 18/21 (2023.01); G06F 18/214 (2023.01); G06F 18/22 (2023.01); G06F 18/2431 (2023.01); G06N 3/09 (2023.01); G06V 10/44 (2022.01); G06V 10/74 (2022.01); G06V 10/764 (2022.01); G06V 10/774 (2022.01); G06V 10/776 (2022.01); G06V 10/82 (2022.01)

CPC G06V 10/454 (2022.01) [G06F 18/214 (2023.01); G06F 18/2178 (2023.01); G06F 18/22 (2023.01); G06F 18/2431 (2023.01); G06N 3/08 (2013.01); G06N 3/09 (2023.01); G06V 10/761 (2022.01); G06V 10/764 (2022.01); G06V 10/774 (2022.01); G06V 10/776 (2022.01); G06V 10/82 (2022.01)]

20 Claims

1. A computing system to perform supervised contrastive learning of visual representations, the computing system comprising:

one or more processors; and

one or more non-transitory computer-readable media that collectively store:

a base encoder neural network configured to process an input image to generate an embedding representation of the input image;

a projection head neural network configured to process the embedding representation of the input image to generate a projected representation of the input image; and

instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising:

obtaining an anchor image associated with a first class of a plurality of classes, a plurality of positive images associated with the first class, and one or more negative images associated with one or more other classes of the plurality of classes, the one or more other classes being different from the first class, wherein:

the anchor image corresponds to a first image from a training dataset;

the plurality of positive images respectively correspond to a plurality of second images from the training dataset; and

the one or more negative images respectively correspond to one or more third images from the training dataset;

processing, with the base encoder neural network, the anchor image to obtain an anchor embedding representation for the anchor image, the plurality of positive images to respectively obtain a plurality of positive embedding representations, and the one or more negative images to respectively obtain one or more negative embedding representations;

processing, with the projection head neural network, the anchor embedding representation to obtain an anchor projected representation for the anchor image, the plurality of positive embedding representations to respectively obtain a plurality of positive projected representations, and the one or more negative embedding representations to respectively obtain one or more negative projected representations;

evaluating a loss function that evaluates a similarity metric between the anchor projected representation and each of the plurality of positive projected representations and each of the one or more negative projected representations; and

modifying one or more values of one or more parameters of at least the base encoder neural network based at least in part on the loss function.