US 11,720,796 B2
Neural episodic control
Benigno Uria-Martínez, London (GB); Alexander Pritzel, London (GB); Charles Blundell, London (GB); and Adrià Puigdomènech Badia, London (GB)
Assigned to DeepMind Technologies Limited, London (GB)
Filed by DeepMind Technologies Limited, London (GB)
Filed on Apr. 23, 2020, as Appl. No. 16/856,527.
Application 16/856,527 is a continuation of application No. 16/445,523, filed on Jun. 19, 2019, granted, now 10,664,753.
Application 16/445,523 is a continuation of application No. PCT/EP2018/054624, filed on Feb. 26, 2018.
Claims priority of provisional application 62/463,558, filed on Feb. 24, 2017.
Prior Publication US 2020/0265317 A1, Aug. 20, 2020
Int. Cl. G06N 3/084 (2023.01); G06N 3/006 (2023.01); G06N 3/08 (2023.01); G06N 3/04 (2023.01); G06N 3/044 (2023.01)
CPC G06N 3/084 (2013.01) [G06N 3/006 (2013.01); G06N 3/08 (2013.01); G06N 3/04 (2013.01); G06N 3/044 (2023.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method for training a neural episodic controller that comprises an embedding neural network, the neural episodic controller maintaining episodic memory data that comprises a respective episodic memory module for each action of a plurality of actions that may be performed by an agent in response to an observation, the episodic memory module for each action mapping each of a respective plurality of key embeddings to a respective return estimate, the method comprising:
sampling, by one or more computers, a training tuple from the replay memory, the training tuple comprising a training observation, a training selected action, and a training return;
processing the training observation using the embedding neural network in accordance with current values of parameters of the embedding neural network to generate a training key embedding for the training observation;
identifying, from the episodic memory data, a respective episodic memory module for the training selected action in the training tuple, wherein the respective episodic memory module (i) includes a first array of vectors with each vector representing a respective key embedding and a second array of vectors with each vector representing a respective return estimate and (ii) maps each key embedding in the first array of vectors to a corresponding return estimate in the second array of vectors, wherein the key embeddings represented by the first array of vectors are key embeddings of observations in response to which the training selected action was performed by the agent, and wherein a return estimate mapped to by a key embedding of a given observation is an estimate of a combination of rewards received after the agent performed the training selected action in response to the given observation;
determining whether the training key embedding associated with the training observation matches any of the key embeddings in the first array of vectors of the episodic memory module for the training selected action;
when the training key embedding matches a key embedding in the first array of vectors of the episodic memory module for the training selected action, updating the episodic memory module using an episodic memory learning rate, wherein updating the episodic memory module comprises mapping the matching key embedding to a new return estimate that is computed based on (i) the training return, (ii) the return estimate currently mapped to by the matching key embedding, and (iii) the episodic memory learning rate;
determining, by the one or more computers, a Q value for the training selected action from the training observation, wherein the Q value for the selected action is a predicted return that would result from the agent performing the training selected action in response to the training observation; and
backpropagating, by the one or more computers, a gradient of an error between the Q value for the training selected action and the training return to update current values of parameters of the embedding neural network using an embedding neural network learning rate that is smaller than the episodic memory learning rate used to update the episodic memory module.