| CPC G06N 3/08 (2013.01) [G06N 3/045 (2023.01)] | 18 Claims |

|
1. A method for generating an output sequence comprising a plurality of output tokens from an input sequence comprising a plurality of input tokens selected from a vocabulary that includes natural language tokens, the method comprising, at each of a plurality of generation time steps:
generating a combined sequence for the generation time step that includes the input sequence followed by the output tokens that have already been generated as of the generation time step;
processing the combined sequence using a self-attention decoder neural network that comprises a plurality of masked self-attention neural network layers, and wherein the self-attention decoder neural network is configured to process the combined sequence to generate a time step output that defines a score distribution over a set of possible output tokens, wherein the masked self-attention neural network layers are masked such that the time step output depends only on the input sequence and the output tokens that have already been generated as of the generation time step and not on any output tokens that are after the last token that had already been generated in the output sequence; and
determining an output token using the time step output.
|