US 12,254,411 B2
	Attention neural networks with linear units
Noam M. Shazeer, Palo Alto, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Feb. 12, 2021, as Appl. No. 17/175,567.
Claims priority of provisional application 62/975,707, filed on Feb. 12, 2020.
Prior Publication US 2021/0248473 A1, Aug. 12, 2021
Int. Cl. G06N 3/08 (2023.01); G06N 3/048 (2023.01); G06N 3/082 (2023.01)

CPC G06N 3/082 (2013.01) [G06N 3/048 (2023.01)]

19 Claims

1. A system for performing a machine learning task on a network input to generate a network output, the system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to implement:

an attention neural network configured to perform the machine learning task, the attention neural network comprising a plurality of attention layers, each attention layer comprising an attention sub-layer and a feed-forward sub-layer, the attention sub-layer configured to:

receive an input sequence for the attention layer comprising a respective layer input at each of one or more positions; and

generate an attended input sequence at least in part by applying an attention mechanism to the input sequence for the attention layer, the attended input sequence comprising a respective attended layer input at each of the one or more positions, and the feed-forward sub-layer configured to:

receive the attended input sequence generated by the attention sub-layer of the attention layer; and

generate an output sequence for the attention layer from the attended input sequence, the output sequence comprising a respective layer output at each of the one or more positions, and the generating comprising, for each of the positions:

generating a first transformed input from the attended layer input at the position in the attended input sequence generated by the attention sub-layer of the attention layer, comprising applying a first linear transformation to the attended layer input at the position;

generating a second transformed input from the attended layer input at the position in the attended input sequence generated by the attention sub-layer of the attention layer, comprising applying a second, different linear transformation to the attended layer input at the position, wherein:

the first and second linear transformations have been learned during the training of the attention neural network to perform the machine learning task,

the training of the attention neural network comprises training the attention neural network on an unsupervised data set through unsupervised learning;

the same first linear transformation is applied to the attended layer inputs at each of the positions in the input sequence, and

the same second linear transformation is applied to the attended layer inputs at each of the positions in the input sequence;

generating a third transformed input by performing an element-wise multiplication between (i) the first transformed input generated from the attended layer input at the position in the attended input sequence generated by the attention sub-layer of the attention layer and (ii) the second transformed input generated from the attended layer input at the position in the attended input sequence generated by the attention sub-layer of the attention layer; and

generating the layer output at the position from the third transformed input;

wherein the attention neural network further comprises one or more output layers that are configured to process at least one of the layer outputs in an output sequence generated by one of the attention layers to generate at least a portion of the network output for performing the machine learning task.