US 12,254,413 B2
	Systems and methods for contrastive learning of visual representations
Ting Chen, Toronto (CA); Simon Komblith, Toronto (CA); Mohammad Norouzi, Toronto (CA); Geoffrey Everest Hinton, Toronto (CA); and Kevin Jordan Swersky, Mississauga (CA)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Jun. 28, 2023, as Appl. No. 18/343,579.
Application 18/343,579 is a continuation of application No. 17/863,070, filed on Jul. 12, 2022, granted, now 11,847,571.
Application 17/863,070 is a continuation of application No. 17/018,372, filed on Sep. 11, 2020, granted, now 11,386,302, issued on Jul. 12, 2022.
Application 17/018,372 is a continuation in part of application No. 16/847,163, filed on Apr. 13, 2020, granted, now 11,354,778, issued on Jun. 7, 2022.
Prior Publication US 2023/0342616 A1, Oct. 26, 2023
This patent is subject to a terminal disclaimer.
Int. Cl. G06V 10/20 (2022.01); G06F 18/21 (2023.01); G06F 18/214 (2023.01); G06F 18/241 (2023.01); G06N 3/08 (2023.01); G06N 3/084 (2023.01); G06V 10/764 (2022.01); G06V 10/774 (2022.01); G06V 10/778 (2022.01)

CPC G06N 3/084 (2013.01) [G06F 18/2155 (2023.01); G06F 18/2178 (2023.01); G06F 18/241 (2023.01); G06N 3/08 (2013.01); G06V 10/764 (2022.01); G06V 10/7753 (2022.01); G06V 10/7788 (2022.01); G06T 2207/20081 (2013.01)]

20 Claims

1. A computing system to perform contrastive learning, the computing system comprising:

one or more processors; and

one or more non-transitory computer-readable media that collectively store:

a base encoder neural network configured to process an input to generate an intermediate representation of the input;

a projection head neural network configured to process the intermediate representation of the input to generate a projected representation of the input; and

instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising:

obtaining one or more training inputs;

performing one or more first augmentation operations on at least one of the training inputs to obtain a first augmented input;

separate from performing the one or more first augmentation operations, performing one or more second augmentation operations on the at least one of the training inputs to obtain a second augmented input;

wherein at least one of the one or more first augmentation operations or the one or more second augmentation operations comprise one or both of a random crop operation that randomly crops the training input and a random color distortion operation that randomly modifies color values of the training input;

respectively processing, with the base encoder neural network, the first augmented input and the second augmented input to respectively generate a first intermediate representation for the first augmented input and a second intermediate representation for the second augmented input;

respectively processing, with the projection head neural network, the first intermediate representation and the second intermediate representation to respectively obtain a first projected representation for the first augmented input and a second projected representation for the second augmented input;

evaluating a loss function that evaluates a difference between the first projected representation and the second projected representation; and

modifying one or more values of one or more parameters of one or both of the base encoder neural network and the projection head neural network based at least in part on the loss function.