| CPC G06V 10/454 (2022.01) [G06F 18/214 (2023.01); G06F 18/2178 (2023.01); G06F 18/22 (2023.01); G06F 18/2431 (2023.01); G06N 3/08 (2013.01); G06N 3/09 (2023.01); G06V 10/761 (2022.01); G06V 10/764 (2022.01); G06V 10/774 (2022.01); G06V 10/776 (2022.01); G06V 10/82 (2022.01)] | 20 Claims |

|
1. A computing system to perform supervised contrastive learning of visual representations, the computing system comprising:
one or more processors; and
one or more non-transitory computer-readable media that collectively store:
a base encoder neural network configured to process an input image to generate an embedding representation of the input image;
a projection head neural network configured to process the embedding representation of the input image to generate a projected representation of the input image; and
instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising:
obtaining an anchor image associated with a first class of a plurality of classes, a plurality of positive images associated with the first class, and one or more negative images associated with one or more other classes of the plurality of classes, the one or more other classes being different from the first class, wherein:
the anchor image corresponds to a first image from a training dataset;
the plurality of positive images respectively correspond to a plurality of second images from the training dataset; and
the one or more negative images respectively correspond to one or more third images from the training dataset;
processing, with the base encoder neural network, the anchor image to obtain an anchor embedding representation for the anchor image, the plurality of positive images to respectively obtain a plurality of positive embedding representations, and the one or more negative images to respectively obtain one or more negative embedding representations;
processing, with the projection head neural network, the anchor embedding representation to obtain an anchor projected representation for the anchor image, the plurality of positive embedding representations to respectively obtain a plurality of positive projected representations, and the one or more negative embedding representations to respectively obtain one or more negative projected representations;
evaluating a loss function that evaluates a similarity metric between the anchor projected representation and each of the plurality of positive projected representations and each of the one or more negative projected representations; and
modifying one or more values of one or more parameters of at least the base encoder neural network based at least in part on the loss function.
|