US 11,727,264 B2
	Reinforcement learning using pseudo-counts
Marc Gendron-Bellemare, London (GB); Remi Munos, London (GB); and Srinivasan Sriram, London (GB)
Assigned to DeepMind Technologies Limited, London (GB)
Appl. No. 16/303,501
Filed by DEEPMIND TECHNOLOGIES LIMITED, London (GB)
PCT Filed May 18, 2017, PCT No. PCT/US2017/033218 § 371(c)(1), (2) Date Nov. 20, 2018, PCT Pub. No. WO2017/201220, PCT Pub. Date Nov. 23, 2017.
Claims priority of provisional application 62/339,778, filed on May 20, 2016.
Prior Publication US 2020/0327405 A1, Oct. 15, 2020
Int. Cl. G06N 3/08 (2023.01); G06F 17/18 (2006.01); G06N 3/047 (2023.01)

CPC G06N 3/08 (2013.01) [G06F 17/18 (2013.01); G06N 3/047 (2023.01)]

20 Claims

1. A computer-implemented method for training a neural network used to select actions to be performed by an agent interacting with an environment, the computer-implemented method comprising:

obtaining data identifying a first observation characterizing a first state of the environment;

selecting, using the neural network, an action to be performed by the agent in response to the first observation;

controlling the agent to perform the selected action;

receiving an actual reward resulting from the agent performing the action in response to the first observation;

determining a pseudo-count for the first observation using a sequential density model which represents a likelihood that the first observation occurs given a sequence of previous observations, wherein the pseudo-count depends upon a number of previous occurrences of the first observation during the training of the neural network;

determining an exploration reward bonus that incentivizes the agent to explore the environment from the pseudo-count for the first observation, wherein the exploration reward bonus is lower when the pseudo-count is higher and vice-versa;

generating a combined reward from the actual reward and the exploration reward bonus; and

training the neural network by adjusting current values of parameters of the neural network using the combined reward.