| CPC G06N 3/088 (2013.01) [G06N 3/045 (2023.01)] | 10 Claims |

|
1. A system for training a neural network model for object identification in images, comprising:
a communication interface configured to receive a training dataset comprising a set of image samples, each image sample having a noisy label that belongs to a plurality of classes;
a non-transitory memory storing the neural network model including an encoder, a first copy of the encoder, a second copy of the encoder, a classifier, an autoencoder and a first copy of the autoencoder both coupled to the classifier, a second copy of the autoencoder and processor-executable instructions; and
one or more hardware processors coupled to the non-transitory memory and configured to read instructions from the non-transitory memory to cause the system to perform operations comprising:
generating, for each image sample of the set of image samples, a first augmented image sample by modifying a first amount of the respective image sample, a second augmented image sample by modifying a second different amount of the same respective image sample, and an interpolated image sample by taking a linear interpolation, the first augmented image sample, the second augmented image sample and the interpolated image sample corresponding to a same original image sample;
encoding, the first augmented image sample by the encoder of the neural network model, the second augmented image sample by a first copy of the encoder and the interpolated image sample by a second copy of the encoder operated in parallel to the encoder into a first high-dimensional feature representation, a second high-dimensional feature representation, and a third high-dimensional feature representation, respectively;
projecting, respectively by the autoencoder, a first copy of the autoencoder and a second copy of the autoencoder operated in parallel, the first high-dimensional feature representation, the second high-dimensional feature representation, and the third high-dimensional feature representation to a first embedding, a second embedding, and a third embedding that are normalized in a low-dimensional embedding space;
reconstructing, respectively by the autoencoder, the first copy of the autoencoder and the second copy of the autoencoder, the first high-dimensional feature representation based on the first embedding, the second high-dimensional feature representation based on the second embedding, and the third high-dimensional feature representation based on the third embedding;
generating, by the classifier, classification probabilities based on the first high-dimensional feature representation and the second high-dimensional feature representation;
computing a cross-entropy loss based on the classification probabilities;
computing a consistency contrastive loss based on a positive pair of the first embedding and the second embedding and one or more negative pair of embeddings projected from augmented image samples that do not correspond to a same original image sample;
computing for each class of the plurality of classes, a respective class prototype as a normalized mean embedding over image samples that belong to the respective class in the training dataset;
computing a prototypical contrastive loss based on a weighted combination of a comparison between the first embedding and the third embedding and a comparison between the second embedding and the third embedding;
computing a reconstruction loss based on a first reconstruction loss and a second reconstruction loss, wherein the first reconstruction loss is calculated based on the first embedding and the reconstructed first high-dimensional feature representation, and the second reconstruction loss is calculated based on the second embedding and the reconstructed second high-dimensional feature representation;
computing a combined loss based on a weighted sum of the cross-entropy loss, the instance contrastive loss, the prototypical contrastive loss and the reconstruction loss;
training the neural network model by minimizing the combined loss; and
predicting, by the trained neural network model, a class label for object identification in an input image.
|