US 12,271,817 B2
	Attention-based decoder-only sequence transduction neural networks
Noam M. Shazeer, Palo Alto, CA (US); Lukasz Mieczyslaw Kaiser, San Francisco, CA (US); Etienne Pot, Palo Alto, CA (US); Mohammad Saleh, Santa Clara, CA (US); Ben David Goodrich, San Francisco, CA (US); Peter J. Liu, Santa Clara, CA (US); and Ryan Sepassi, Beverly Hills, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Jan. 4, 2024, as Appl. No. 18/403,966.
Application 18/403,966 is a continuation of application No. 18/096,946, filed on Jan. 13, 2023, granted, now 11,886,998.
Application 18/096,946 is a continuation of application No. 16/759,690, granted, now 11,556,786, issued on Jan. 17, 2023, previously published as PCT/US2018/058025, filed on Oct. 29, 2018.
Claims priority of provisional application 62/578,358, filed on Oct. 27, 2017.
Prior Publication US 2024/0256859 A1, Aug. 1, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G06N 3/08 (2023.01); G06N 3/045 (2023.01)

CPC G06N 3/08 (2013.01) [G06N 3/045 (2023.01)]

18 Claims

1. A method for generating an output sequence comprising a plurality of output tokens from an input sequence comprising a plurality of input tokens selected from a vocabulary that includes natural language tokens, the method comprising, at each of a plurality of generation time steps:

generating a combined sequence for the generation time step that includes the input sequence followed by the output tokens that have already been generated as of the generation time step;

processing the combined sequence using a self-attention decoder neural network that comprises a plurality of masked self-attention neural network layers, and wherein the self-attention decoder neural network is configured to process the combined sequence to generate a time step output that defines a score distribution over a set of possible output tokens, wherein the masked self-attention neural network layers are masked such that the time step output depends only on the input sequence and the output tokens that have already been generated as of the generation time step and not on any output tokens that are after the last token that had already been generated in the output sequence; and

determining an output token using the time step output.