US 12,242,818 B2
	Sequence modeling using imputation
William Chan, Markham (CA); Chitwan Saharia, Toronto (CA); Geoffrey E. Hinton, Toronto (CA); Mohammad Norouzi, Richmond Hill (CA); and Navdeep Jaitly, Mountain View, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Appl. No. 17/797,872
Filed by Google LLC, Mountain View, CA (US)
PCT Filed Feb. 8, 2021, PCT No. PCT/US2021/017131 § 371(c)(1), (2) Date Aug. 5, 2022, PCT Pub. No. WO2021/159103, PCT Pub. Date Aug. 12, 2021.
Claims priority of provisional application 63/009,970, filed on Apr. 14, 2020.
Claims priority of provisional application 62/971,769, filed on Feb. 7, 2020.
Prior Publication US 2023/0075716 A1, Mar. 9, 2023
Int. Cl. G06F 40/47 (2020.01); G06F 40/284 (2020.01)

CPC G06F 40/47 (2020.01) [G06F 40/284 (2020.01)]

23 Claims

1. A method of generating, from an input sequence having a respective input token at each of a plurality of input positions, an output sequence having a respective output token from a vocabulary of output tokens at each of a plurality of output positions, the method comprising:

receiving the input sequence;

determining a plurality of blocks, wherein each block comprises a plurality of input tokens having consecutive input positions from the input positions, wherein the input tokens comprise a first token modality associated with audio tokens, text tokens, or image tokens;

processing the input sequence using a neural network to generate a latent alignment of the input sequence, wherein the latent alignment comprises, at each of the input positions, either an output token from the vocabulary of output tokens or a blank token, the processing comprising, at each of a plurality of input time steps:

receiving a partial latent alignment from a previous input time step, wherein the partial latent alignment comprises, at each of the input positions, one of: an output token, a blank token, or a mask token;

selecting an input position in each block, wherein the token at the selected input position of the partial latent alignment in each block is a mask token; and

processing i) the partial latent alignment and ii) the input sequence using the neural network to generate a new latent alignment, wherein the new latent alignment comprises, at the selected input position in each block, an output token or a blank token; and

generating, using the latent alignment, the output sequence, wherein the output tokens comprise a second token modality associated with audio tokens, text tokens, or image tokens.