US 11,727,264 B2
Reinforcement learning using pseudo-counts
Marc Gendron-Bellemare, London (GB); Remi Munos, London (GB); and Srinivasan Sriram, London (GB)
Assigned to DeepMind Technologies Limited, London (GB)
Appl. No. 16/303,501
Filed by DEEPMIND TECHNOLOGIES LIMITED, London (GB)
PCT Filed May 18, 2017, PCT No. PCT/US2017/033218
§ 371(c)(1), (2) Date Nov. 20, 2018,
PCT Pub. No. WO2017/201220, PCT Pub. Date Nov. 23, 2017.
Claims priority of provisional application 62/339,778, filed on May 20, 2016.
Prior Publication US 2020/0327405 A1, Oct. 15, 2020
Int. Cl. G06N 3/08 (2023.01); G06F 17/18 (2006.01); G06N 3/047 (2023.01)
CPC G06N 3/08 (2013.01) [G06F 17/18 (2013.01); G06N 3/047 (2023.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method for training a neural network used to select actions to be performed by an agent interacting with an environment, the computer-implemented method comprising:
obtaining data identifying a first observation characterizing a first state of the environment;
selecting, using the neural network, an action to be performed by the agent in response to the first observation;
controlling the agent to perform the selected action;
receiving an actual reward resulting from the agent performing the action in response to the first observation;
determining a pseudo-count for the first observation using a sequential density model which represents a likelihood that the first observation occurs given a sequence of previous observations, wherein the pseudo-count depends upon a number of previous occurrences of the first observation during the training of the neural network;
determining an exploration reward bonus that incentivizes the agent to explore the environment from the pseudo-count for the first observation, wherein the exploration reward bonus is lower when the pseudo-count is higher and vice-versa;
generating a combined reward from the actual reward and the exploration reward bonus; and
training the neural network by adjusting current values of parameters of the neural network using the combined reward.