US 12,272,442 B2
	Transfer learning between different computer vision tasks
Xiaohua Zhai, Zurich (CH); Sylvain Gelly, Zurich (CH); Alexander Kolesnikov, Zurich (CH); Yin Ching Jessica Yung, Vienna (AT); Joan Puigcerver i Perez, Zurich (CH); Lucas Klaus Beyer, Zurich (CH); Neil Matthew Tinmouth Houlsby, Zurich (CH); Wen Yau Aaron Loh, Mountain View, CA (US); Alan Prasana Karthikesalingam, London (GB); Basil Mustafa, Zurich (CH); Jan Freyberg, London (GB); Patricia Leigh MacWilliams, London (GB); and Vivek Natarajan, Sunnyvale, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Dec. 14, 2021, as Appl. No. 17/551,050.
Claims priority of provisional application 63/125,353, filed on Dec. 14, 2020.
Prior Publication US 2022/0189612 A1, Jun. 16, 2022
Int. Cl. G16H 30/40 (2018.01); G06N 3/02 (2006.01); G06N 3/096 (2023.01); G06T 3/4053 (2024.01); G06T 7/00 (2017.01); G06V 10/82 (2022.01); G16H 50/20 (2018.01)

CPC G16H 30/40 (2018.01) [G06N 3/02 (2013.01); G06N 3/096 (2023.01); G06T 3/4053 (2013.01); G06T 7/0012 (2013.01); G06V 10/82 (2022.01); G16H 50/20 (2018.01); G06T 2207/20081 (2013.01)]

20 Claims

1. A method performed by one or more computers for training a neural network comprising a plurality of neural network layers to perform a downstream computer vision task, each of the plurality of neural network layers having a respective plurality of layer parameters, and the neural network being configured to receive a network input comprising one or more images to and to process the network input to generate a network output for the downstream computer vision task, the method comprising:

pre-training an initial neural network comprising a first subset of the plurality of neural network layers on first training data for an initial computer vision task through supervised learning to determine first values of the respective layer parameters of the first subset of neural network layers, wherein the first subset of the plurality of neural network layers comprises:

a first convolutional neural network layer with weight standardization (WS) followed by a group normalization (GN) layer, wherein the initial neural network comprises the first subset of neural network layers and one or more first additional neural network layers that (i) receive an output generated by the first subset of neural network layers by processing the network input and (ii) process the output generated by the first subset of neural network layers to generate an output for the initial computer vision task, and wherein the GN layer performs group normalization comprising:

receiving as input a feature map comprising a plurality of channels, each comprising a plurality of values;

dividing the plurality of channels into a plurality of groups, wherein for each group, the GN layer (i) computes a mean and a standard deviation of the respective values of the channels within the group and (ii) normalizes each value of the respective channels within the group using the mean and the standard deviation computed for the group; and

generating an output feature map by applying a learned per-channel linear transformation to the normalized values; and

after pre-training the initial neural network comprising the first subset of the plurality of neural network layers on the first training data for the initial computer vision task, training the neural network on second training data for the downstream computer vision task through supervised learning to determine trained values of the respective layer parameters of the first subset of neural network layers from the first values of the respective layer parameters of the first subset of neural network layers, wherein the neural network comprises:

the first subset of neural network layers of the initial neural network comprising the first convolutional neural network layer with WS followed by the GN layer; and

one or more second additional neural network layers that (i) receive the output generated by the first subset of neural network layers by processing the network input and (ii) process the output generated by the first subset of neural network layers to generate the network output for the downstream computer vision task, and

wherein the second training data for the downstream computer vision task contains fewer training examples than the first training data for the initial computer vision task.