US 12,080,055 B2
	Multi-task self-training for learning general representations
Tsung-Yi Lin, Sunnyvale, CA (US); Barret Zoph, Sunnyvale, CA (US); Ekin Dogus Cubuk, Sunnyvale, CA (US); Golnaz Ghiasi, Mountain View, CA (US); and Quoc V. Le, Sunnyvale, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Mar. 17, 2022, as Appl. No. 17/697,750.
Claims priority of provisional application 63/162,467, filed on Mar. 17, 2021.
Prior Publication US 2022/0301298 A1, Sep. 22, 2022
Int. Cl. G06V 10/82 (2022.01); G06N 3/084 (2023.01); G06V 10/764 (2022.01); G06V 10/77 (2022.01); G06V 10/774 (2022.01); G06V 10/776 (2022.01); G06V 10/80 (2022.01)

CPC G06V 10/82 (2022.01) [G06N 3/084 (2013.01); G06V 10/764 (2022.01); G06V 10/7715 (2022.01); G06V 10/774 (2022.01); G06V 10/776 (2022.01); G06V 10/806 (2022.01)]

20 Claims

1. A method performed by one or more computers for training an image representation neural network having a plurality of image representation parameters, wherein the image representation neural network is configured to receive as input an image and to process the image in accordance with the image representation parameters to generate as output a set of feature maps characterizing the image, the method comprising repeatedly performing operations comprising:

obtaining training data comprising a plurality of training images and, for each of the training images, a respective label for each of a plurality of different computer vision tasks, wherein:

for one or more of the plurality of training images, the respective label for least one of the computer vision tasks is a pseudo-label that is generated by processing the training image using a teacher neural network that corresponds to the computer vision task and that has already been trained to perform the computer vision task;

processing the training images using the image representation neural network to generate a respective set of feature maps for each of the training images;

for each training image and for each of the plurality of different computer vision tasks:

generating a feature representation for the training image and for the computer vision task from one or more of the feature maps for the image representation neural network; and

processing the feature representation using an output neural network head corresponding to the computer vision task to generate a predicted output for the computer vision task; and

updating the image representation parameters by computing a gradient with respect to the image representation parameters of an overall loss function that includes a respective task-specific loss function for each of the plurality of computer vision tasks that measures, for each training image, an error between (i) the predicted output for the computer vision task for the training image and (ii) the respective label for the computer vision task for the training image.