US 12,106,220 B2
	Regularization of recurrent machine-learned architectures with encoder, decoder, and prior distribution
Maksims Volkovs, Toronto (CA); Mathieu Jean Remi Ravaut, Toronto (CA); Kin Kwan Leung, Toronto (CA); and Hamed Sadeghi, Toronto (CA)
Assigned to The Toronto-Dominion Bank, Toronto (CA)
Filed by The Toronto-Dominion Bank, Toronto (CA)
Filed on Jun. 7, 2019, as Appl. No. 16/435,213.
Claims priority of provisional application 62/778,277, filed on Dec. 11, 2018.
Prior Publication US 2020/0184338 A1, Jun. 11, 2020
Int. Cl. G06N 3/084 (2023.01); G06N 3/044 (2023.01); G06N 3/045 (2023.01); G06N 20/20 (2019.01)

CPC G06N 3/084 (2013.01) [G06N 3/044 (2023.01); G06N 3/045 (2023.01); G06N 20/20 (2019.01)]

21 Claims

1. A method of training a recurrent machine-learned model having an encoder network, a decoder network, and a transition network, the method comprising:

obtaining a sequence of observations;

for each observation in the sequence, repeatedly performing the steps of:

generating a current latent distribution for a current observation by applying the encoder network to the current observation and values of the encoder network for one or more previous observations, the current latent distribution representing a distribution for a latent state of the current observation given a value of the current observation and a latent state for the one or more previous observations;

generating a prior distribution by inputting a value generated from a previous latent distribution for at least a previous observation directly preceding the current observation in the sequence directly to an input layer of a transition network without inputting the observations and the latent state of the current observation into the transition network, the previous latent distribution generated by applying the encoder network to the previous observation, the prior distribution representing a distribution for the latent state of the current observation given the latent state for the one or more previous observations independent of the value of the current observation;

generating an estimated latent state for the current observation from the current latent distribution;

generating a predicted likelihood for observing a subsequent observation that comes after the current observation in the sequence given the latent state for the current observation by applying the decoder network directly to the estimated latent state for the current observation without inputting the observations into the decoder network; and

determining a loss for the current observation including a combination of a prediction loss and a divergence loss, the prediction loss indicating a difference between the predicted likelihood and the subsequent observation, and the divergence loss indicating a measure of difference between the current latent distribution and the prior distribution; and

determining a loss function of the sequence of observations as a combination of the losses for each observation in the sequence; and

backpropagating one or more error terms from the loss function to update parameters of the encoder network, the decoder network, and the transition network.