US 11,893,483 B2
	Attention-based sequence transduction neural networks
Noam M. Shazeer, Palo Alto, CA (US); Aidan Nicholas Gomez, Toronto (CA); Lukasz Mieczyslaw Kaiser, Mountain View, CA (US); Jakob D. Uszkoreit, Portola Valley, CA (US); Llion Owen Jones, San Francisco, CA (US); Niki J. Parmar, Sunnyvale, CA (US); Illia Polosukhin, Mountain View, CA (US); and Ashish Teku Vaswani, San Francisco, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Aug. 7, 2020, as Appl. No. 16/988,547.
Application 16/988,547 is a continuation of application No. 16/932,422, filed on Jul. 17, 2020, granted, now 11,113,602.
Application 16/932,422 is a continuation of application No. 16/559,392, filed on Sep. 3, 2019, granted, now 10,719,764, issued on Jul. 21, 2020.
Application 16/559,392 is a continuation of application No. 16/021,971, filed on Jun. 28, 2018, granted, now 10,452,978, issued on Oct. 22, 2019.
Application 16/021,971 is a continuation of application No. PCT/US2018/034224, filed on May 23, 2018.
Claims priority of provisional application 62/541,594, filed on Aug. 4, 2017.
Claims priority of provisional application 62/510,256, filed on May 23, 2017.
Prior Publication US 2020/0372358 A1, Nov. 26, 2020
Int. Cl. G06N 20/00 (2019.01); G06N 3/08 (2023.01); G06N 3/045 (2023.01); G06N 3/04 (2023.01)

CPC G06N 3/08 (2013.01) [G06N 3/04 (2013.01); G06N 3/045 (2023.01); G06N 20/00 (2019.01)]

20 Claims

1. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to implement a neural network for generating a network output by processing an input sequence having a respective network input at each of a plurality of input positions, the neural network comprising:

a first neural network comprising a sequence of one or more subnetworks, each subnetwork configured to (i) receive a respective subnetwork input for each of a plurality of preceding input positions that precede a current input position in an ordering of the input positions, and (ii) generate a respective subnetwork output for each preceding input position, and wherein each subnetwork comprises:

a self-attention sub-layer that is configured to receive the respective subnetwork input for each of the plurality of preceding input positions in the ordering of the input positions and, for each particular input position of the preceding input positions:

apply a self-attention mechanism over the subnetwork inputs at the preceding input positions to generate a respective output for the particular input position, wherein applying a self-attention mechanism comprises: determining a query according to the subnetwork input at the particular input position, determining keys derived from the subnetwork inputs at the preceding input positions, determining values derived from the subnetwork inputs at the preceding input positions, and using the determined query, keys, and values to generate the respective output for the particular input position.