US 12,354,005 B2
Attention-based decoder-only sequence transduction neural networks
Noam M. Shazeer, Palo Alto, CA (US); Lukasz Mieczyslaw Kaiser, San Francisco, CA (US); Etienne Pot, Palo Alto, CA (US); Mohammad Saleh, Santa Clara, CA (US); Ben David Goodrich, San Francisco, CA (US); Peter J. Liu, Santa Clara, CA (US); and Ryan Sepassi, Beverly Hills, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Jan. 4, 2024, as Appl. No. 18/403,992.
Application 18/403,992 is a continuation of application No. 18/096,946, filed on Jan. 13, 2023, granted, now 11,886,998.
Application 18/096,946 is a continuation of application No. 16/759,690, granted, now 11,556,786, issued on Jan. 17, 2023, previously published as PCT/US2018/058025, filed on Oct. 29, 2018.
Claims priority of provisional application 62/578,358, filed on Oct. 27, 2017.
Prior Publication US 2024/0220796 A1, Jul. 4, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G06N 3/08 (2023.01); G06N 3/045 (2023.01)
CPC G06N 3/08 (2013.01) [G06N 3/045 (2023.01)] 19 Claims
OG exemplary drawing
 
1. A system comprising:
a user computer; and
a computer system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations comprising:
receiving, from the user computer, input data specifying an input sequence comprising a plurality of input tokens of a natural language;
at each of a plurality of generation time steps:
generating a combined sequence for the generation time step that includes the input sequence followed by output tokens that have already been generated as of the generation time step;
processing the combined sequence using a self-attention decoder neural network that comprises a plurality of masked self-attention neural network layers, and wherein the self-attention decoder neural network is configured to process the combined sequence to generate a time step output; and
determining a respective output token using the time step output; and
providing, to the user computer, output data specifying an output sequence comprising the output tokens determined for the plurality of generation time steps.