| CPC G06F 40/47 (2020.01) [G06F 40/284 (2020.01)] | 23 Claims |

|
1. A method of generating, from an input sequence having a respective input token at each of a plurality of input positions, an output sequence having a respective output token from a vocabulary of output tokens at each of a plurality of output positions, the method comprising:
receiving the input sequence;
determining a plurality of blocks, wherein each block comprises a plurality of input tokens having consecutive input positions from the input positions, wherein the input tokens comprise a first token modality associated with audio tokens, text tokens, or image tokens;
processing the input sequence using a neural network to generate a latent alignment of the input sequence, wherein the latent alignment comprises, at each of the input positions, either an output token from the vocabulary of output tokens or a blank token, the processing comprising, at each of a plurality of input time steps:
receiving a partial latent alignment from a previous input time step, wherein the partial latent alignment comprises, at each of the input positions, one of: an output token, a blank token, or a mask token;
selecting an input position in each block, wherein the token at the selected input position of the partial latent alignment in each block is a mask token; and
processing i) the partial latent alignment and ii) the input sequence using the neural network to generate a new latent alignment, wherein the new latent alignment comprises, at the selected input position in each block, an output token or a blank token; and
generating, using the latent alignment, the output sequence, wherein the output tokens comprise a second token modality associated with audio tokens, text tokens, or image tokens.
|