US 12,468,952 B2
Systems and methods for noise-robust contrastive learning
Junnan Li, Singapore (SG); and Chu Hong Hoi, Singapore (SG)
Assigned to Salesforce, Inc., San Francisco, CA (US)
Filed by Salesforce, Inc., San Francisco, CA (US)
Filed on Sep. 9, 2020, as Appl. No. 17/015,858.
Claims priority of provisional application 63/033,547, filed on Jun. 2, 2020.
Prior Publication US 2021/0374553 A1, Dec. 2, 2021
Int. Cl. G06N 3/088 (2023.01); G06N 3/045 (2023.01)
CPC G06N 3/088 (2013.01) [G06N 3/045 (2023.01)] 10 Claims
OG exemplary drawing
 
1. A system for training a neural network model for object identification in images, comprising:
a communication interface configured to receive a training dataset comprising a set of image samples, each image sample having a noisy label that belongs to a plurality of classes;
a non-transitory memory storing the neural network model including an encoder, a first copy of the encoder, a second copy of the encoder, a classifier, an autoencoder and a first copy of the autoencoder both coupled to the classifier, a second copy of the autoencoder and processor-executable instructions; and
one or more hardware processors coupled to the non-transitory memory and configured to read instructions from the non-transitory memory to cause the system to perform operations comprising:
generating, for each image sample of the set of image samples, a first augmented image sample by modifying a first amount of the respective image sample, a second augmented image sample by modifying a second different amount of the same respective image sample, and an interpolated image sample by taking a linear interpolation, the first augmented image sample, the second augmented image sample and the interpolated image sample corresponding to a same original image sample;
encoding, the first augmented image sample by the encoder of the neural network model, the second augmented image sample by a first copy of the encoder and the interpolated image sample by a second copy of the encoder operated in parallel to the encoder into a first high-dimensional feature representation, a second high-dimensional feature representation, and a third high-dimensional feature representation, respectively;
projecting, respectively by the autoencoder, a first copy of the autoencoder and a second copy of the autoencoder operated in parallel, the first high-dimensional feature representation, the second high-dimensional feature representation, and the third high-dimensional feature representation to a first embedding, a second embedding, and a third embedding that are normalized in a low-dimensional embedding space;
reconstructing, respectively by the autoencoder, the first copy of the autoencoder and the second copy of the autoencoder, the first high-dimensional feature representation based on the first embedding, the second high-dimensional feature representation based on the second embedding, and the third high-dimensional feature representation based on the third embedding;
generating, by the classifier, classification probabilities based on the first high-dimensional feature representation and the second high-dimensional feature representation;
computing a cross-entropy loss based on the classification probabilities;
computing a consistency contrastive loss based on a positive pair of the first embedding and the second embedding and one or more negative pair of embeddings projected from augmented image samples that do not correspond to a same original image sample;
computing for each class of the plurality of classes, a respective class prototype as a normalized mean embedding over image samples that belong to the respective class in the training dataset;
computing a prototypical contrastive loss based on a weighted combination of a comparison between the first embedding and the third embedding and a comparison between the second embedding and the third embedding;
computing a reconstruction loss based on a first reconstruction loss and a second reconstruction loss, wherein the first reconstruction loss is calculated based on the first embedding and the reconstructed first high-dimensional feature representation, and the second reconstruction loss is calculated based on the second embedding and the reconstructed second high-dimensional feature representation;
computing a combined loss based on a weighted sum of the cross-entropy loss, the instance contrastive loss, the prototypical contrastive loss and the reconstruction loss;
training the neural network model by minimizing the combined loss; and
predicting, by the trained neural network model, a class label for object identification in an input image.